Rensselaer Polytechnic Institute
library(Ecdat)
project1 <- Workinghours
Data were drawn from 1987 cross section of the Michigan Panel Study of Income Dynamics. selected married couples with nonnegative family total income, where wife was of working age (18-64) and not self-employed. The study consists of 3382 observations and 12 variables.
A dataframe containing:
head(project1)
## hours income age education child5 child13 child17 nonwhite owned
## 1 2000 350 26 12 0 1 0 0 1
## 2 390 241 29 8 0 1 1 0 1
## 3 1900 160 33 10 0 2 0 0 1
## 4 0 80 20 9 2 0 0 0 1
## 5 3177 456 33 12 0 2 0 0 1
## 6 0 390 22 12 2 0 0 0 1
## mortgage occupation unemp
## 1 1 swcc 7
## 2 1 other 4
## 3 0 swcc 7
## 4 1 other 7
## 5 1 swcc 7
## 6 1 other 7
Factors and Levels of each factor is listed below:
The 4 factors being studied include:
owned (convert from integer to factor)
project1$owned <-as.factor(project1$owned)
str(project1$owned)
## Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
nonwhite (convert from integer to factor)
project1$nonwhite <-as.factor(project1$nonwhite)
str(project1$nonwhite)
## Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
mortgage (convert from integer to factor)
project1$mortgage <-as.factor(project1$mortgage)
str(project1$mortgage)
## Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 2 1 1 ...
occupation (factor w/ 4 levels)
levels(project1$occupation)
## [1] "other" "mp" "swcc" "fr"
The response variable is hours the labor supply of married females measured in hours per year
The study consists of 3382 observations and 12 variables. A new dataset is created by subsetting the data using the 4 factors discussed above.
myvars <- c("hours", "owned", "nonwhite", "mortgage", "occupation")
newdata <- project1[myvars]
head(newdata)
## hours owned nonwhite mortgage occupation
## 1 2000 1 0 1 swcc
## 2 390 1 0 1 other
## 3 1900 1 0 0 swcc
## 4 0 1 0 1 other
## 5 3177 1 0 1 swcc
## 6 0 1 0 1 other
str(newdata)
## 'data.frame': 3382 obs. of 5 variables:
## $ hours : int 2000 390 1900 0 3177 0 0 1040 2040 0 ...
## $ owned : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
## $ nonwhite : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mortgage : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 2 1 1 ...
## $ occupation: Factor w/ 4 levels "other","mp","swcc",..: 3 1 3 1 3 1 3 2 4 1 ...
summary(newdata)
## hours owned nonwhite mortgage occupation
## Min. : 0 0:1079 0:2382 0:1597 other:1314
## 1st Qu.: 0 1:2303 1:1000 1:1785 mp : 962
## Median :1304 swcc :1021
## Mean :1135 fr : 85
## 3rd Qu.:1944
## Max. :5840
This experiment will use a multi-factor analysis of variance by subsetting the data as shown in the previous section. The analysis will use various attributes as factors such as owned, nonwhite, mortgage, occupation.
The rationale for this type of design is to analyze the variation among and between the factors.
There was no randomization, because the dataset was a set of observation.
The data was generated by selecting unique married couples. There is no mention of any repeated measures or replicates of this study.
No blocking was used in this analysis, because subsetting of the data down to the four relevant factors already minimizes variace of error caused by nuisance factors.
Based on these boxplots, we can conclude that all of the four factors had some sort of effect on the labor supply of married females. “Mortgage Holder” factor seemed to show the greatest difference between levels.
In order to determine the statistical significance of the results of the factorial experiment, an ANOVA test will be conducted.
model1 <- aov(hours ~ owned*nonwhite*mortgage*occupation, data = newdata)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## owned 1 1.850e+06 1849655 2.391 0.12213
## nonwhite 1 6.666e+05 666609 0.862 0.35332
## mortgage 1 5.089e+07 50889477 65.785 6.98e-16 ***
## occupation 3 1.751e+07 5835515 7.544 4.99e-05 ***
## owned:nonwhite 1 1.462e+06 1462354 1.890 0.16925
## nonwhite:mortgage 1 1.279e+06 1278941 1.653 0.19860
## owned:occupation 3 1.214e+07 4045207 5.229 0.00134 **
## nonwhite:occupation 3 5.758e+06 1919279 2.481 0.05923 .
## mortgage:occupation 3 3.260e+06 1086695 1.405 0.23943
## owned:nonwhite:occupation 3 9.422e+05 314061 0.406 0.74871
## nonwhite:mortgage:occupation 3 2.769e+05 92304 0.119 0.94878
## Residuals 3358 2.598e+09 773577
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of the ANOVA gives the p-values for each factor as well as the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, hours, cannot be explained by anythng other than randomization. This ANOVA summary supports the null hypothesis. There are no significant p-values for owned, nonwhite, mortgage, occupation and all interactions, except:
Because the p-values are greater than alpha of 0.05, they are not significant. This leads us to not reject the null hypothesis and conclude that the variation in hours is explained by randomization alone.
par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))
The Q-Q Normality Plot of the residuals shows that the points falls along a line in the middle of the graph, but curve off in the extremities. This usually means that the data has more extreme values that would be expected is they truly came from a Normal distribution.
plot(fitted(model1), residuals(model1))
Residual = Observed - Predicted
Therefore…
Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.
Journal of Applied Econometrics, Vol. 10, No. 2 (Apr. - Jun., 1995), pp. 187-200