Project 2: Null Hypothesis Statistical Testing

1. Setting

This study analyzes data of corporate profits by industry and year from 1929-1947, which is an interesting time as it spans WWII. Data are from the Bureau of Economic Analysis and are presented in billions of dollars.

In this experiment, the two factors (categorical independent variables) are industry and year, while the continuous dependent variable is corporate profits. Industry contains twelve different levels: Corporate Profits With Adjustments, Financial, Nonfinancial, Rest Of World, Federal Reserve Banks, Other, Manufacturing, Durable Goods, Nondurable Goods, Transportation And Utilities, and Wholesale Trade Retail Trade Automobile. Year contains 19 levels (each year from 1929-1947).

As the data contain two factors, they are in a 19 x 12 matrix and contain no missing values.

2. Design

This experiment utilizes a randomized block design. One variable, year, will be blocked, and the hypothesis will be tested on the second, instrutry. Exploratory analysis will first be conducted. ANOVA will be conducted along with alternatives to null hypothesis statistical testing. The experiment will culminate with model validation.

Randomization: As there was no control over randomization in data collection or assignment to treatments, the data will be randomized without replacement for analysis.

Replication: For each industry and year, there is a single value for corporate profits. Thus, there are neither replicates nor repeated measures for this experiment.

Blocking: Blocking is used to reduce the variability of a sample. As it is hypothesized that the year will have an effect on corporate profits, year is blocked during the experiment. This will enable elucidation of the role of industry on corporate profits during this timeframe.

3. Statistical Analysis

Exploratory Boxplots

First, exploratory boxplots were conducted to study the individual levels of the factor, industry.

##   CorporateProfitsWithAdjustments DomesticIndustries Financial
## 1                            10.7               10.4       1.6
## 2                             7.4                7.2       0.7
## 3                             2.8                2.8       0.5
##   Nonfinancial RestOfWorld FederalReserveBanks Other Manufacturing
## 1          8.9         0.2                   0   1.6           5.2
## 2          6.6         0.1                   0   0.7           3.9
## 3          2.3         0.0                   0   0.6           1.3
##   DurableGoods NondurableGoods TransportationAndUtilities
## 1          2.6             2.6                        1.9
## 2          1.4             2.5                        1.2
## 3         -0.1             1.4                        0.6
##   WholesaleTradeRetailTradeAutomobile
## 1                                 1.2
## 2                                 1.1
## 3                                 0.5

The effect size was next calculated for the entire dataset using etasquared. ANOVA was done to find the sum of squares effect and the sum of squares total.

A randomized block design was made, using years as the blocked variable and industry as the factor studied.

Industry levels

##  [1] "CorporateProfitsWithAdjustments"    
##  [2] "Financial"                          
##  [3] "Nonfinancial"                       
##  [4] "RestOfWorld"                        
##  [5] "FederalReserveBanks"                
##  [6] "Other"                              
##  [7] "Manufacturing"                      
##  [8] "DurableGoods"                       
##  [9] "NondurableGoods"                    
## [10] "TransportationAndUtilities"         
## [11] "WholesaleTradeRetailTradeAutomobile"

Number of industry levels

## [1] 12

Number of blocks

## [1] 19

ANOVA summary table

THe ANOVA was conducted to utilize the sum of squares values to calculate effect size.

##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## level_assignment  18   2195  121.94   4.544 2.91e-08 ***
## block             11    543   49.32   1.838   0.0499 *  
## Residuals        198   5314   26.84                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Effect size = SSeffect/SStotal

## [1] 0.4130598

Type I error is finding a difference between two groups when one does not exist, while Type II error is when a difference exists but the tests do not detect it. For this experiment, I am starting with a type I error, , of 0.05 and a type II error, , of 0.20. The power is the probability of finding a difference when it exists and is calculated by 1-and is thus 0.80. The numerator degrees of freedom (df) is 18 or one less than the sample size, k. The number of groups is 12. The effect size, f, was calculated using the sum of squares for the industry effect and total, resulting in 0.413, (a moderate effect). G*Power was used to determine the critical F and total sample size needed to achieve a power of 0.8. From the F-test family, an ANOVA: Fixed effects, special, main effects, and interactions was utilized, and a post hoc power analysis was used. The resulting critical F was 1.69, and the sample size required to achieve power ~ 0.80 was 133.

From the original n=228 dataset, 133 were randomly selected, and ANOVA was conducted.

New ANOVA summary table using re-sampled data

##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## level_assignment  6  561.5   93.59   5.955 5.15e-05 ***
## block2           11  656.5   59.68   3.797 0.000306 ***
## Residuals        66 1037.2   15.72                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confidence Intervals

The confidence intervals for each level are as follows:

##  [1] 10.90000000 10.62105263  1.07368421  9.55263158  0.28947368
##  [6]  0.02105263  1.15263158  5.83684211  2.78947368  3.03157895
## [11]  1.57894737  1.64210526

These data are the upper and lower bounds for a 95% confidence interval; it is expected that 95% of sample means (or true values) would fall between these bounds. When the confidence interval does not overlap with zero, it is suggested that the effect is due to more than variance from randomization alone.

Resampling

The F distribution is plotted for the resampled data, resampling 25,000 times. The critical F-value is included as the black point.

##        F 
## 4.352005

## [1] 0

Below are the calculated values of the quantiles of the F distribution. The latter is the bootstrapped quantile. Despite increasing the resampling to 100,000, the bootstrapped value was less that the critical value; thus, 25,000 was used to conserve computation time.

## [1] 1.65341

##      95% 
## 1.607222

Model Accuracy Checks

Fit of profits~industry

The normal QQ-plot for the profits~industry analysis has large deviations as the theoretical quantiles get further from zero. Additionally, the difference in group of the residuals suggests the variance is not equal. Thus, a more appropriate model is required.

Fit of profits~industry+year

The normal QQ-plot for the profits~industry+year analysis is suggestive of a reasonable model for this dataset. However, the distribution of points the residuals vs. fitted graphs are not evenly distributed about zero, suggesting the relationship is not linear. The difference in group of the residuals suggests the variance is not equal. Thus, this is not the best model for the data.

4. References

D. C. Montgomery, Design and Analysis of Experiments, 8th ed. Hoboken, NJ: John Wiley & Sons, Inc., 2013.

5. Appendix

This section contains all code used to analyze this data.

Exploratory Boxplots

#Load in data using Utils package
library(utils)
setwd("C:/Users/Alexis/Documents/R/win-library/3.3")
df <- read.csv("103116_Project2Data.csv", header = TRUE)
df <- df[,2:13]

#Show first 3 rows of data table
head(df, 3)

##   CorporateProfitsWithAdjustments DomesticIndustries Financial
## 1                            10.7               10.4       1.6
## 2                             7.4                7.2       0.7
## 3                             2.8                2.8       0.5
##   Nonfinancial RestOfWorld FederalReserveBanks Other Manufacturing
## 1          8.9         0.2                   0   1.6           5.2
## 2          6.6         0.1                   0   0.7           3.9
## 3          2.3         0.0                   0   0.6           1.3
##   DurableGoods NondurableGoods TransportationAndUtilities
## 1          2.6             2.6                        1.9
## 2          1.4             2.5                        1.2
## 3         -0.1             1.4                        0.6
##   WholesaleTradeRetailTradeAutomobile
## 1                                 1.2
## 2                                 1.1
## 3                                 0.5

#Boxplot examining levels of each industry for exploratory analysis
boxplotdata <- boxplot(df, xlab = "industry", ylab = "corporate profits", main = "industry vs. corporate profits", las=2)

The effect size was next calculated for the entire dataset using . ANOVA was done to find the sum of squares effect and the sum of squares total.

#A randomized block design was made, using years as the blocked variable.
#First, the rows (years) were concatenated in one vector.
vector = c(t(as.matrix(df)))

#industry_levels was assigned as the variable for industry.
industry_levels = c("CorporateProfitsWithAdjustments", "Financial", "Nonfinancial", "RestOfWorld", "FederalReserveBanks", "Other", "Manufacturing", "DurableGoods", "NondurableGoods", "TransportationAndUtilities", "WholesaleTradeRetailTradeAutomobile")
industry_levels

##  [1] "CorporateProfitsWithAdjustments"    
##  [2] "Financial"                          
##  [3] "Nonfinancial"                       
##  [4] "RestOfWorld"                        
##  [5] "FederalReserveBanks"                
##  [6] "Other"                              
##  [7] "Manufacturing"                      
##  [8] "DurableGoods"                       
##  [9] "NondurableGoods"                    
## [10] "TransportationAndUtilities"         
## [11] "WholesaleTradeRetailTradeAutomobile"

#number of industry levels
k = length(1:ncol(df))
k

## [1] 12

#number of blocks
n = length(1:nrow(df))
n

## [1] 19

#A vector of treatment levels was made to establish to which level each element in the vector belongs.
level_assignment = gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)

block = gl(k, n, k*n)

ANOVA_result = aov(vector~level_assignment + block)

#ANOVA summary table
summary(ANOVA_result)

##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## level_assignment  18   2195  121.94   4.544 2.91e-08 ***
## block             11    543   49.32   1.838   0.0499 *  
## Residuals        198   5314   26.84                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Effect size = SSeffect/SStotal
effectsize = 2195/5314
effectsize

## [1] 0.4130598

From the original n=228 dataset, 133 were randomly selected, and ANOVA was conducted.

#To randomly select 133 pieces of data, 7 were chosen at random from each row and combined to form a new matrix.
dfrand <- df[sample(1:nrow(df), 7, replace=FALSE),]
head(dfrand,3)

##   CorporateProfitsWithAdjustments DomesticIndustries Financial
## 3                             2.8                2.8       0.5
## 1                            10.7               10.4       1.6
## 4                            -0.3               -0.2       0.6
##   Nonfinancial RestOfWorld FederalReserveBanks Other Manufacturing
## 3          2.3         0.0                   0   0.6           1.3
## 1          8.9         0.2                   0   1.6           5.2
## 4         -0.9         0.0                   0   0.7          -0.6
##   DurableGoods NondurableGoods TransportationAndUtilities
## 3         -0.1             1.4                        0.6
## 1          2.6             2.6                        1.9
## 4         -1.1             0.6                        0.2
##   WholesaleTradeRetailTradeAutomobile
## 3                                 0.5
## 1                                 1.2
## 4                                -0.1

#A randomized block design was made, using years as the blocked variable.
#First, the rows (years) were concatenated in one vector.
vector2 = c(t(as.matrix(dfrand)))

#industry_levels was assigned as the variable for industry.
industry_levels = c("CorporateProfitsWithAdjustments", "Financial", "Nonfinancial", "RestOfWorld", "FederalReserveBanks", "Other", "Manufacturing", "DurableGoods", "NondurableGoods", "TransportationAndUtilities", "WholesaleTradeRetailTradeAutomobile")
industry_levels

##  [1] "CorporateProfitsWithAdjustments"    
##  [2] "Financial"                          
##  [3] "Nonfinancial"                       
##  [4] "RestOfWorld"                        
##  [5] "FederalReserveBanks"                
##  [6] "Other"                              
##  [7] "Manufacturing"                      
##  [8] "DurableGoods"                       
##  [9] "NondurableGoods"                    
## [10] "TransportationAndUtilities"         
## [11] "WholesaleTradeRetailTradeAutomobile"

#number of industry levels
k2 = length(1:ncol(dfrand))
k2

## [1] 12

#number of blocks
n2 = length(1:nrow(dfrand))
n2

## [1] 7

#A vector of treatment levels was made to establish to which level each element in the vector belongs.
level_assignment = gl(n2, k2, length = n2*k2, labels = seq_len(n2), ordered = FALSE)

block2 = gl(k2, n2, k2*n2)

ANOVA_result2 = aov(vector2~level_assignment + block2)

#ANOVA summary table
summary(ANOVA_result2)

##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## level_assignment  6  449.3   74.89   6.055 4.32e-05 ***
## block2           11  236.9   21.54   1.742   0.0833 .  
## Residuals        66  816.3   12.37                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Load in data using Utils package
library(utils)
setwd("C:/Users/Alexis/Documents/R/win-library/3.3")
df2 <- read.csv("103116_Project2Data2.csv", header = TRUE)
head(df2,3)

##   year                        industry profits
## 1 1929 CorporateProfitsWithAdjustments    10.7
## 2 1930 CorporateProfitsWithAdjustments     7.4
## 3 1931 CorporateProfitsWithAdjustments     2.8

fit = aov(df2$profits~df2$industry+df2$year)
summary(fit)

##               Df Sum Sq Mean Sq F value Pr(>F)    
## df2$industry  11   3529   320.8   22.46 <2e-16 ***
## df2$year       1   1452  1451.8  101.65 <2e-16 ***
## Residuals    215   3071    14.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(fit)

##                                                             2.5 %
## (Intercept)                                         -1064.6094827
## df2$industry      NondurableGoods                      -2.1746563
## df2$industry    FederalReserveBanks                    -5.1851826
## df2$industry    Manufacturing                           0.6306069
## df2$industry    Other                                  -4.0536036
## df2$industry    TransportationAndUtilities             -3.6272878
## df2$industry    WholesaleTradeRetailTradeAutomobile    -3.5641300
## df2$industry  Financial                                -4.1325510
## df2$industry  Nonfinancial                              4.3463964
## df2$industryCorporateProfitsWithAdjustments             5.6937648
## df2$industryDomesticIndustries                          5.4148174
## df2$industryRestOfWorld                                -4.9167615
## df2$year                                                0.3706344
##                                                            97.5 %
## (Intercept)                                         -715.49156993
## df2$industry      NondurableGoods                      2.65886680
## df2$industry    FederalReserveBanks                   -0.35165952
## df2$industry    Manufacturing                          5.46412995
## df2$industry    Other                                  0.77991943
## df2$industry    TransportationAndUtilities             1.20623522
## df2$industry    WholesaleTradeRetailTradeAutomobile    1.26939311
## df2$industry  Financial                                0.70097206
## df2$industry  Nonfinancial                             9.17991943
## df2$industryCorporateProfitsWithAdjustments           10.52728785
## df2$industryDomesticIndustries                        10.24834048
## df2$industryRestOfWorld                               -0.08323847
## df2$year                                               0.55076914

Estimate the effect size from the main effect of the non-blocking IV. Use G*Power to determine the sample size N N needed to achieve the desired power.

Select, at random, N N experimental cases.

with(df2,tapply(profits,year,mean))

##        1929        1930        1931        1932        1933        1934 
##  3.90833333  2.73333333  1.05833333 -0.09166667 -0.09166667  0.89166667 
##        1935        1936        1937        1938        1939        1940 
##  1.45000000  2.23333333  2.55833333  1.76666667  2.35833333  3.61666667 
##        1941        1942        1943        1944        1945        1946 
##  5.75000000  7.65000000  9.19166667  9.11666667  7.30000000  6.59166667 
##        1947 
##  8.78333333

with(df2,tapply(profits,year,var))

##       1929       1930       1931       1932       1933       1934 
## 15.3935606  7.9606061  1.1281061  0.3335606  0.2717424  0.6881061 
##       1935       1936       1937       1938       1939       1940 
##  1.9572727  4.9551515  6.4608333  2.7460606  5.3044697 13.0760606 
##       1941       1942       1943       1944       1945       1946 
## 35.9681818 64.3772727 94.6008333 93.3578788 60.1945455 44.0317424 
##       1947 
## 79.5815152

with(df2,tapply(profits,year,length))

## 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 
##   12   12   12   12   12   12   12   12   12   12   12   12   12   12   12 
## 1944 1945 1946 1947 
##   12   12   12   12

#
# We need the sd(residual error) from the summary for the Monte Carlo simulation
summary(aov(profits~year,data=df2))

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## year          1   1452  1451.8   49.72 2.14e-11 ***
## Residuals   226   6600    29.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# now the bootstrap version of ANOVA
#
data(df2)

## Warning in data(df2): data set 'df2' not found

meanstar = with(df2, tapply(profits,year,mean))
grpA = df2$profits[df2$year=="1929"] - meanstar[1]
grpB = df2$profits[df2$year=="1930"] - meanstar[2]
grpC = df2$profits[df2$year=="1931"] - meanstar[3]
grpD = df2$profits[df2$year=="1932"] - meanstar[4]
grpE = df2$profits[df2$year=="1933"] - meanstar[5]
grpF = df2$profits[df2$year=="1934"] - meanstar[6]
grpG = df2$profits[df2$year=="1935"] - meanstar[7]
grpH = df2$profits[df2$year=="1936"] - meanstar[8]
grpI = df2$profits[df2$year=="1937"] - meanstar[9]
grpJ = df2$profits[df2$year=="1938"] - meanstar[10]
grpK = df2$profits[df2$year=="1939"] - meanstar[11]
grpL = df2$profits[df2$year=="1940"] - meanstar[12]
grpM = df2$profits[df2$year=="1941"] - meanstar[13]
grpN = df2$profits[df2$year=="1942"] - meanstar[14]
grpO = df2$profits[df2$year=="1943"] - meanstar[15]
grpP = df2$profits[df2$year=="1944"] - meanstar[16]
grpQ = df2$profits[df2$year=="1945"] - meanstar[17]
grpR = df2$profits[df2$year=="1946"] - meanstar[18]
grpS = df2$profits[df2$year=="1947"] - meanstar[19]
simyears = df2$year
R = 25000
Fstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=12, replace=T)
  groupB = sample(grpB, size=12, replace=T)
  groupC = sample(grpC, size=12, replace=T)
  groupD = sample(grpD, size=12, replace=T)
  groupE = sample(grpE, size=12, replace=T)
  groupF = sample(grpF, size=12, replace=T)
  groupG = sample(grpG, size=12, replace=T)
  groupH = sample(grpH, size=12, replace=T)
  groupI = sample(grpI, size=12, replace=T)
  groupJ = sample(grpJ, size=12, replace=T)
  groupK = sample(grpK, size=12, replace=T)
  groupL = sample(grpL, size=12, replace=T)
  groupM = sample(grpM, size=12, replace=T)
  groupN = sample(grpN, size=12, replace=T)
  groupO = sample(grpO, size=12, replace=T)
  groupP = sample(grpP, size=12, replace=T)
  groupQ = sample(grpQ, size=12, replace=T)
  groupR = sample(grpR, size=12, replace=T)
  groupS = sample(grpS, size=12, replace=T)
  simprofits = c(groupA,groupB,groupC,groupD,groupE,groupF,groupG,groupH,groupI,groupJ,groupK,groupL,groupM,groupN,groupO,groupP,groupQ,groupR,groupS)
  simdata = data.frame(simprofits,simyears)
  Fstar[i] = oneway.test(simprofits~simyears, var.equal=T, data=simdata)$statistic
}

# Now generate a similar plot, comparing bootstrapped distribution to the known F distribution
hist(Fstar, ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,5,90),type="b",col="red")  # The F distribution


# Here is the alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(profits~year, var.equal=T, data=df2)$statistic)

##        F 
## 4.352005

mean(Fstar>=realFstar)

## [1] 0

points(abs(realFstar),0,pch=16)

# estimate alpha = 0.05 value of the test statistic, given F* from empirical distribution above
# generate quantiles of the analytic F distribution
qf(.95,18,209)

## [1] 1.65341

# alpha = 0.05 value of bootstrapped F* is output here
quantile(Fstar,.95)

##      95% 
## 1.612152

fit2 = aov(df2$profits~df2$industry+df2$year)
plot(fit2, which = c(1:2), caption = list("Residuals vs Fitted", "Normal Q-Q"))

fit3 = aov(df2$profits~df2$industry)
plot(fit3, which = c(1:2), caption = list("Residuals vs Fitted", "Normal Q-Q"))

Project 2: Null Hypothesis Statistical Testing

Alexis Ziemba, RPI

October 31, 2016

1. Setting

2. Design

3. Statistical Analysis

Exploratory Boxplots

ANOVA summary table

Confidence Intervals

Resampling

Model Accuracy Checks

4. References

5. Appendix

Exploratory Boxplots