Rensselaer Polytechnic Institute

1.Setting

library(Ecdat)
project1 <- Workinghours

System Under Test

Data were drawn from 1987 cross section of the Michigan Panel Study of Income Dynamics. selected married couples with nonnegative family total income, where wife was of working age (18-64) and not self-employed. The study consists of 3382 observations and 12 variables.

A dataframe containing:

  • hours : wife working hours per year
  • income: the other household income in thousands of dollars
  • age: age of the wife
  • education: education years of the wife
  • child5 : number of children for ages 0 to 5
  • child13 : number of children for ages 6 to 13
  • child17 : number of children for ages 14 to 17
  • nonwhite : 0= white, 1= other race
  • owned: 1= owner, 0 = otherwise
  • mortgage: 1 = home on mortgages, 0 = otherwise
  • occupation: occupation of the husband
  • unemp: local unemployment rate
head(project1)
##   hours income age education child5 child13 child17 nonwhite owned
## 1  2000    350  26        12      0       1       0        0     1
## 2   390    241  29         8      0       1       1        0     1
## 3  1900    160  33        10      0       2       0        0     1
## 4     0     80  20         9      2       0       0        0     1
## 5  3177    456  33        12      0       2       0        0     1
## 6     0    390  22        12      2       0       0        0     1
##   mortgage occupation unemp
## 1        1       swcc     7
## 2        1      other     4
## 3        0       swcc     7
## 4        1      other     7
## 5        1       swcc     7
## 6        1      other     7

Factors and Levels

Factors and Levels of each factor is listed below:

The 4 factors being studied include:

owned (convert from integer to factor)

project1$owned <-as.factor(project1$owned)
str(project1$owned)
##  Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
  • 0 = Doesn’t own house
  • 1 = Does own house

nonwhite (convert from integer to factor)

project1$nonwhite <-as.factor(project1$nonwhite)
str(project1$nonwhite)
##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
  • 0 = white
  • 1 = nonwhite

mortgage (convert from integer to factor)

project1$mortgage <-as.factor(project1$mortgage)
str(project1$mortgage)
##  Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 2 1 1 ...
  • 0 = non mortgage holder
  • 1 = mortgage holder

occupation (factor w/ 4 levels)

levels(project1$occupation)
## [1] "other" "mp"    "swcc"  "fr"
  • mp = manager professional
  • swcc = sales worker
  • fr = farm-related market
  • other = other

Continuous Variables

  • hours : wife working hours per year
  • income: the other household income in thousands of dollars
  • education: education years of the wife

Reponse Variables

The response variable is hours the labor supply of married females measured in hours per year

The Data: How is it organized and what does it look like?

The study consists of 3382 observations and 12 variables. A new dataset is created by subsetting the data using the 4 factors discussed above.

myvars <- c("hours", "owned", "nonwhite", "mortgage", "occupation")
newdata <- project1[myvars]
head(newdata)
##   hours owned nonwhite mortgage occupation
## 1  2000     1        0        1       swcc
## 2   390     1        0        1      other
## 3  1900     1        0        0       swcc
## 4     0     1        0        1      other
## 5  3177     1        0        1       swcc
## 6     0     1        0        1      other
str(newdata)
## 'data.frame':    3382 obs. of  5 variables:
##  $ hours     : int  2000 390 1900 0 3177 0 0 1040 2040 0 ...
##  $ owned     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
##  $ nonwhite  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mortgage  : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 1 2 1 1 ...
##  $ occupation: Factor w/ 4 levels "other","mp","swcc",..: 3 1 3 1 3 1 3 2 4 1 ...
summary(newdata)
##      hours      owned    nonwhite mortgage occupation  
##  Min.   :   0   0:1079   0:2382   0:1597   other:1314  
##  1st Qu.:   0   1:2303   1:1000   1:1785   mp   : 962  
##  Median :1304                              swcc :1021  
##  Mean   :1135                              fr   :  85  
##  3rd Qu.:1944                                          
##  Max.   :5840

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment will use a multi-factor analysis of variance by subsetting the data as shown in the previous section. The analysis will use various attributes as factors such as owned, nonwhite, mortgage, occupation.

What is the rationale for this design?

The rationale for this type of design is to analyze the variation among and between the factors.

Randomize: What is the Randomization Scheme?

There was no randomization, because the dataset was a set of observation.

Replicate: Are there replicates and/or repeated measures?

The data was generated by selecting unique married couples. There is no mention of any repeated measures or replicates of this study.

Block: Did you use blocking in the design?

No blocking was used in this analysis, because subsetting of the data down to the four relevant factors already minimizes variace of error caused by nuisance factors.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Based on these boxplots, we can conclude that all of the four factors had some sort of effect on the labor supply of married females. “Mortgage Holder” factor seemed to show the greatest difference between levels.

Testing

In order to determine the statistical significance of the results of the factorial experiment, an ANOVA test will be conducted.

model1 <- aov(hours ~ owned*nonwhite*mortgage*occupation, data = newdata)
summary(model1)
##                                Df    Sum Sq  Mean Sq F value   Pr(>F)    
## owned                           1 1.850e+06  1849655   2.391  0.12213    
## nonwhite                        1 6.666e+05   666609   0.862  0.35332    
## mortgage                        1 5.089e+07 50889477  65.785 6.98e-16 ***
## occupation                      3 1.751e+07  5835515   7.544 4.99e-05 ***
## owned:nonwhite                  1 1.462e+06  1462354   1.890  0.16925    
## nonwhite:mortgage               1 1.279e+06  1278941   1.653  0.19860    
## owned:occupation                3 1.214e+07  4045207   5.229  0.00134 ** 
## nonwhite:occupation             3 5.758e+06  1919279   2.481  0.05923 .  
## mortgage:occupation             3 3.260e+06  1086695   1.405  0.23943    
## owned:nonwhite:occupation       3 9.422e+05   314061   0.406  0.74871    
## nonwhite:mortgage:occupation    3 2.769e+05    92304   0.119  0.94878    
## Residuals                    3358 2.598e+09   773577                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary of the ANOVA gives the p-values for each factor as well as the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, hours, cannot be explained by anythng other than randomization. This ANOVA summary supports the null hypothesis. There are no significant p-values for owned, nonwhite, mortgage, occupation and all interactions, except:

  • owned: occupation

Because the p-values are greater than alpha of 0.05, they are not significant. This leads us to not reject the null hypothesis and conclude that the variation in hours is explained by randomization alone.

Diagnostics/Model Adequacy Checking

par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))

The Q-Q Normality Plot of the residuals shows that the points falls along a line in the middle of the graph, but curve off in the extremities. This usually means that the data has more extreme values that would be expected is they truly came from a Normal distribution.

plot(fitted(model1), residuals(model1))

Residual = Observed - Predicted

  • Positive values on the y-axis (residual) means prediction was too low
  • Negative valyes mean that prediction was too high.
  • 0 means the guess was exactly correct

Therefore…

  • The plot above shows that they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot
  • Overall there aren’t clear patterns

4. References to the literature

Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.

Journal of Applied Econometrics, Vol. 10, No. 2 (Apr. - Jun., 1995), pp. 187-200

5. Appendices

https://cran.r-project.org/web/packages/Ecdat/index.html