Recipe 6: Analysis of Covariance"

Recipes for the Design of Experiments

Fortune 500

Jane Braun

RPI

October 30th Version 1.0

1. Setting

System under test

The data used in this recipe was found using the “100+ interesting data sets” webpage.

This dataset has information from 79 companies which were selected as part of the Forbes 500 list in 1986. The comprehensive list of Forbes 500 includes more than 800 companies, because Forbes selects the top 500 in any of its selection criteria.

# Read in dataset from URL
Forbes500 <- read.delim("C:\\Users\\braunj6\\AppData\\Local\\Temp\\RtmpaC9pEu\\data1e2419cc593a")

#Rename for simplicity
x <- Forbes500

# Recategorize data as numeric from integers
x$Assets <- as.numeric(x$Assets)
x$Sales <- as.numeric(x$Sale)
x$Market_Value <- as.numeric(x$Market_Value)

Factors and Levels

The data being examined has 79 observations of 8 variables.

head(x)
##                     Company Assets Sales Market_Value Profits Cash_Flow
## 1              Air Products   2687  1870         1890   145.7     352.2
## 2             Allied Signal  13271  9115         8190  -279.0      83.0
## 3   American Electric Power  13621  4848         4572   485.0     898.9
## 4 American Savings Bank FSB   3614   367           90    14.1      24.6
## 5                       AMR   6425  6131         2448   345.8     682.5
## 6            Apple Computer   1022  1754         1370    72.0     119.5
##   Employees         sector
## 1      18.2          Other
## 2     143.8          Other
## 3      23.4         Energy
## 4       1.1        Finance
## 5      49.5 Transportation
## 6       4.8         HiTech
summary(x)
##                       Company       Assets          Sales        
##  AH Robins                : 1   Min.   :  223   Min.   :  176.0  
##  Air Products             : 1   1st Qu.: 1122   1st Qu.:  815.5  
##  Allied Signal            : 1   Median : 2788   Median : 1754.0  
##  American Electric Power  : 1   Mean   : 5941   Mean   : 4178.3  
##  American Savings Bank FSB: 1   3rd Qu.: 5802   3rd Qu.: 4563.5  
##  AMR                      : 1   Max.   :52634   Max.   :50056.0  
##  (Other)                  :73                                    
##   Market_Value        Profits         Cash_Flow         Employees     
##  Min.   :   53.0   Min.   :-771.5   Min.   :-651.90   Min.   :  0.60  
##  1st Qu.:  512.5   1st Qu.:  39.0   1st Qu.:  75.15   1st Qu.:  3.95  
##  Median :  944.0   Median :  70.5   Median : 133.30   Median : 15.40  
##  Mean   : 3269.8   Mean   : 209.8   Mean   : 400.93   Mean   : 37.60  
##  3rd Qu.: 1961.5   3rd Qu.: 188.1   3rd Qu.: 328.85   3rd Qu.: 48.50  
##  Max.   :95697.0   Max.   :6555.0   Max.   :9874.00   Max.   :400.20  
##                                                                       
##            sector  
##  Finance      :17  
##  Energy       :15  
##  Manufacturing:10  
##  Retail       :10  
##  HiTech       : 8  
##  Other        : 7  
##  (Other)      :12
# As seen below, there are 79 observations of 8 variables, including: 2 factors and 5 numeric variables.
str(x)
## 'data.frame':    79 obs. of  8 variables:
##  $ Company     : Factor w/ 79 levels "AH Robins","Air Products",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ Assets      : num  2687 13271 13621 3614 6425 ...
##  $ Sales       : num  1870 9115 4848 367 6131 ...
##  $ Market_Value: num  1890 8190 4572 90 2448 ...
##  $ Profits     : num  145.7 -279 485 14.1 345.8 ...
##  $ Cash_Flow   : num  352.2 83 898.9 24.6 682.5 ...
##  $ Employees   : num  18.2 143.8 23.4 1.1 49.5 ...
##  $ sector      : Factor w/ 9 levels "Communication",..: 7 7 2 3 9 4 5 7 3 1 ...

Continuous variables

The continuous variables in this dataset are assets, sales, market value, profits and cash flow, with each of those variables being in millions of dollars.

Response variables

The response variable being analyzed in this dataset is the amount of assets each company has.

summary(x$Assets)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     223    1122    2788    5941    5802   52630

The Data

The data is organized into 8 columns. There are 2 factors and 6 variables.

Randomization

There was no randomization in this recipe because of the data points are based on observations about the Forbes 500 companies.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will use an analysis of covariance (ANCOVA). It will use a single factor - market sector - which has 9 levels. It will also use multiple independent explanatory variables, such as sales, market value, profits, and cashflows, in order to predict the response variable - assets.

What is the rationale for this design?

Analysis of Covarariance is a blend of ANOVA and regression, and is used to analyze the effect of a categorical variable on a contiuous dependent variable. An ANCOVA controls for the effects of other continuous variables, which vary along with the dependent variable. These continuous control variables are the covariates. In this design, the categorial factor is market sector, the continous dependent (response) variable in assets, and the covariates are ales, market value, profits, and cashflows. An ANCOVA is used because we want to see the effect that market sector has on amount of assets, but we need to control for the other continuous variables, which cannotet be randomized, but can be measured.

Randomize:

The experiment was based on observations, and was therefore not randomized.

Replicate:

The experiment had no repeated measures.

Block:

There was no blocking in this recipe.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

attach(x)

The plot below compares assets, sales, and market value as dependent and independent variables. The graphs horizonal to a given variable use that variable as an independent variable, and the graphs vertical to the given variable uses that variable as a dependent variable. For example, the graph in the top-middle placement shows a generally positive correlation between sales and assets, with assets as the response (independent) variable. Because “assets” is the response variable of interest, the top row of graphs is of most interest.

plot(x[,c(2,3,4)], main = "Cross Comparison of Assets, Sales, and Market Value (in $M)")  

As before, the plot below compares three different continuous variables, but this time, assets, sales, and market value were evaluated as dependent and independent variables. The graphs horizonal to a given variable use that variable as an independent variable, and the graphs vertical to the given variable uses that variable as a dependent variable. For example, the graph in the top-middle placement shows a generally positive correlation between sales and assets, with assets as the response (independent) variable.

plot(x[,c(3,4,5)], main = "Cross Comparison of Sales, Market Value, and Profits (in $M)") 

The plot below shows the relationship between sales and assets. As stated before, it appears that there is a generally positive correlation between sales and assets, with assets as the response variable.

plot(Assets~Sales, main = "Assets by Sales", xlab = "Sales (in million dollars)", ylab = "Assets (in million dollars)")  

The relationship between sales and profits seems to have neither a positive nor negative correlation. With the exception of two outliers, the relationship seems relatively horizontal.

plot(Profits~Sales, main = "Profits by Sales", xlab = "Sales (in million dollars)", ylab = "Profits (in million dollars)")

Similarly to the last graph, assets does not seem to predict profits. As assets increase, there are both increased and decreased profits.

plot(Profits~Assets, main = "Profits by Assets", xlab = "Assets (in million dollars)", ylab = "Profits (in million dollars)")

The boxplot below shows the relationship between the various sectors and their associated profits. Hi-tech and communication both have the largest ranges, and slightly greater medians than the other sectors.

boxplot(Profits~sector, main = "Profits by Sector", xlab = "Market Sector", ylab = "Profits (in million dollars)")

The boxplot below shows the relationship between the various market sectors and their associated assets. The communication sector has the greatest median assets. The hi-tech sector has the greatest range in assets. Medical seems to have the least median assets

boxplot(Assets~sector, main = "Assets by Sector", xlab = "Market Sector", ylab = "Assets (in million dollars)")  

detach(x)

Testing

Analysis of CoVariance

The focus of this recipe was on a single-factor, multiple dependent variable analysis of covariance.

#ANCOVA Testing

#Comparing the assets of various market sectors, given the profits, cash flows, and market values of those groups. 
# H0: There is no difference between the mean assets of the various sectors. The variation in assets is not caused by anything other than randomization.
# Ha: The difference in variation of assets can caused by something other than randomization. 

model1 <- lm(Assets~Profits + Cash_Flow + Market_Value + sector , data = x)
anova(model1)
## Analysis of Variance Table
## 
## Response: Assets
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Profits       1 2367764664 2367764664 80.3273 4.466e-13 ***
## Cash_Flow     1  618926766  618926766 20.9973 2.061e-05 ***
## Market_Value  1  632445688  632445688 21.4560 1.719e-05 ***
## sector        8  945981531  118247691  4.0116 0.0006167 ***
## Residuals    67 1974923534   29476471                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary of the ANCOVA gives the p-values for each factor and variable. The null hypothesis states that the variation in the response variable, assets (in millions of dollars), cannot be explained by anything other than randomization. The p-values in this ANCOVA support that this is not the case. The p-values for profits, cash flow, market value, and sector are less than an alpha of 0.05. Therefore, it is likely that the variation in assets can be explained by these factors.

Tukey’s Range Tests

Tukey’s range tests are used alongside ANCOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.

The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means. In this Tukey test, the difference between the means of pairs of sectors were compared.

tukey1 <- TukeyHSD(aov(Assets ~ sector, data = x))
tukey1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Assets ~ sector, data = x)
## 
## $sector
##                                     diff        lwr       upr     p adj
## Energy-Communication          -8347.0000 -29525.014 12831.014 0.9389272
## Finance-Communication         -2100.7647 -23131.707 18930.178 0.9999964
## HiTech-Communication           -228.6250 -22469.988 22012.738 1.0000000
## Manufacturing-Communication   -8123.8000 -29915.796 13668.196 0.9553649
## Medical-Communication        -10557.0000 -34921.192 13807.192 0.8990303
## Other-Communication           -9336.5714 -31893.430 13220.288 0.9204048
## Retail-Communication         -10147.1000 -31939.096 11644.896 0.8562152
## Transportation-Communication  -9951.1667 -32921.948 13019.614 0.8991371
## Finance-Energy                 6246.2353  -3719.889 16212.359 0.5444965
## HiTech-Energy                  8118.3750  -4198.334 20435.084 0.4752950
## Manufacturing-Energy            223.2000 -11262.190 11708.590 1.0000000
## Medical-Energy                -2210.0000 -18041.518 13621.518 0.9999517
## Other-Energy                   -989.5714 -13867.265 11888.122 0.9999995
## Retail-Energy                 -1800.1000 -13285.490  9685.290 0.9998836
## Transportation-Energy         -1604.1667 -15193.864 11985.531 0.9999867
## HiTech-Finance                 1872.1397 -10189.933 13934.212 0.9998919
## Manufacturing-Finance         -6023.0353 -17234.925  5188.854 0.7326352
## Medical-Finance               -8456.2353 -24090.468  7177.998 0.7254055
## Other-Finance                 -7235.8067 -19870.175  5398.561 0.6603990
## Retail-Finance                -8046.3353 -19258.225  3165.554 0.3580328
## Transportation-Finance        -7850.4020 -21209.749  5508.945 0.6287802
## Manufacturing-HiTech          -7895.1750 -21239.993  5449.643 0.6202368
## Medical-HiTech               -10328.3750 -27556.461  6899.711 0.6033144
## Other-HiTech                  -9107.9464 -23668.336  5452.443 0.5471248
## Retail-HiTech                 -9918.4750 -23263.293  3426.343 0.3116804
## Transportation-HiTech         -9722.5417 -24916.285  5471.202 0.5161836
## Medical-Manufacturing         -2433.2000 -19077.112 14210.712 0.9999313
## Other-Manufacturing           -1212.7714 -15077.036 12651.493 0.9999987
## Retail-Manufacturing          -2023.3000 -14604.915 10558.315 0.9998587
## Transportation-Manufacturing  -1827.3667 -16355.364 12700.631 0.9999783
## Other-Medical                  1220.4286 -16413.079 18853.936 0.9999998
## Retail-Medical                  409.9000 -16234.012 17053.812 1.0000000
## Transportation-Medical          605.8333 -17554.163 18765.830 1.0000000
## Retail-Other                   -810.5286 -14674.793 13053.736 0.9999999
## Transportation-Other           -614.5952 -16266.544 15037.354 1.0000000
## Transportation-Retail           195.9333 -14332.064 14723.931 1.0000000
plot(tukey1)

The results of the Tukey Test display all p-values greater than an alpha, 0.05 between all pairs of sectors. This leads us to fail to reject the null. Therefore, there is some probability that there is no significant difference between the mean assets of the pairs of sectors. The plot of the Tukey also supports this claim. The 95% confidence intervals of the pair’s difference still include 0 as a difference in mean levels. Therefore, “no difference” can not be not included from the possibilities.

Diagnostics/Model Adequacy Checking

Assumptions for ANCOVA tests include:
* the model must include at least one categorical and at least one continuous independ variable - in our model, we have one categorical, and three independent continuous variables.
* The dependent variable must in continuous - “Assets” is a continuous variable
* Normality of the response variable and residuals - normality will be evaluated below:

par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))

The Q-Q Normality Plots of the residuals for this model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.

plot(fitted(model1),residuals(model1))

The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do not appear to be symmetric over the length of the dynamic range, and are not symmetric over 0. There is a large cluster on the lower part of the range.

shapiro.test(x$Assets)
## 
##  Shapiro-Wilk normality test
## 
## data:  x$Assets
## W = 0.5848, p-value = 1.314e-13

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the assets data is not normally distributed.

4. Contingencies

Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.

A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.

kruskal.test(Assets ~ sector, data = x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Assets by sector
## Kruskal-Wallis chi-squared = 16.0199, df = 8, p-value = 0.0421

Unfortunately, with the kruskal-wallis test, we could only analyze how the various sector groups’ mean varied. The p-value of this Kruskal-Wallis test is just barely less than an alpha of 0.05. Therefore, you must reject the null - that variation in assets cannot be explained by anything other than randomization. Therefore, variation can be explained, in part, market sector.

Especially in financial data, data transformation can be used to create a better fit for analysis.

# The first transformation performed was done by taking the log of the response variable - assets.

x1 <- cbind(x, log(x$Assets)) # add transformed variable as a column
str(x1) # structure of new dataframe with transformed response variable. 
## 'data.frame':    79 obs. of  9 variables:
##  $ Company      : Factor w/ 79 levels "AH Robins","Air Products",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ Assets       : num  2687 13271 13621 3614 6425 ...
##  $ Sales        : num  1870 9115 4848 367 6131 ...
##  $ Market_Value : num  1890 8190 4572 90 2448 ...
##  $ Profits      : num  145.7 -279 485 14.1 345.8 ...
##  $ Cash_Flow    : num  352.2 83 898.9 24.6 682.5 ...
##  $ Employees    : num  18.2 143.8 23.4 1.1 49.5 ...
##  $ sector       : Factor w/ 9 levels "Communication",..: 7 7 2 3 9 4 5 7 3 1 ...
##  $ log(x$Assets): num  7.9 9.49 9.52 8.19 8.77 ...
model2 <- lm(log(x$Assets)~ Profits + Cash_Flow + Market_Value + sector , data = x1)
anova(model2)
## Analysis of Variance Table
## 
## Response: log(x$Assets)
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Profits       1 14.454 14.4537 17.4124 8.859e-05 ***
## Cash_Flow     1 18.730 18.7300 22.5641 1.113e-05 ***
## Market_Value  1  1.184  1.1839  1.4263  0.236584    
## sector        8 24.880  3.1100  3.7467  0.001121 ** 
## Residuals    67 55.615  0.8301                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# As seen in the summary of the model, profits, cash flow, and sector all have significant p-values, meaning that the variation in assets can be explained by something other than variation. In this case, it suggests that profits, cash flow, and sector can all help to explain this variation.

plot(fitted(model2),residuals(model2))

# When compared to the original mode, these residuals are significantly more scattered. Also, they appear to be more symmetric over the dynamic range. 

par(mfrow = c(1,1))
qqnorm(residuals(model2))
qqline(residuals(model2))

# The q-q plot appears to be much more normally distributed. The data points are quite linear. 


# The next transformation was completed by taking the log of market value. 
x2 <- cbind(x1, log(x$Market_Value))
str(x2)
## 'data.frame':    79 obs. of  10 variables:
##  $ Company            : Factor w/ 79 levels "AH Robins","Air Products",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ Assets             : num  2687 13271 13621 3614 6425 ...
##  $ Sales              : num  1870 9115 4848 367 6131 ...
##  $ Market_Value       : num  1890 8190 4572 90 2448 ...
##  $ Profits            : num  145.7 -279 485 14.1 345.8 ...
##  $ Cash_Flow          : num  352.2 83 898.9 24.6 682.5 ...
##  $ Employees          : num  18.2 143.8 23.4 1.1 49.5 ...
##  $ sector             : Factor w/ 9 levels "Communication",..: 7 7 2 3 9 4 5 7 3 1 ...
##  $ log(x$Assets)      : num  7.9 9.49 9.52 8.19 8.77 ...
##  $ log(x$Market_Value): num  7.54 9.01 8.43 4.5 7.8 ...
model3 <- lm(log(x$Assets)~ Profits + Cash_Flow + log(x$Market_Value) + sector , data = x1)
anova(model2)
## Analysis of Variance Table
## 
## Response: log(x$Assets)
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Profits       1 14.454 14.4537 17.4124 8.859e-05 ***
## Cash_Flow     1 18.730 18.7300 22.5641 1.113e-05 ***
## Market_Value  1  1.184  1.1839  1.4263  0.236584    
## sector        8 24.880  3.1100  3.7467  0.001121 ** 
## Residuals    67 55.615  0.8301                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Similarly to the second model, this model results in significant p-values for profits, cash flow, and sector, and insignificant p-values for market value. The significant p-values indicates that the variation in assets (now log of assets) can be explained by something other than randomization - in this case, profits, cash flow, and market sector. 

plot(fitted(model3),residuals(model3))

# This model appears to be less symmetric than the previous model. The fitted points vs. residuals would appear more symmetric across the dynamic range had it not been for the two points that seem to be outliers. 

par(mfrow = c(1,1))
qqnorm(residuals(model3))
qqline(residuals(model3))

# While this Q-Q plot is still better than the original, these data points are less linear than the previous model. Therefore, it appears that this model results in not as strong of parametric residuals. 

par(mfrow=c(1,3))
qqnorm(residuals(model1), main = "QQ of Original")
qqline(residuals(model1))
qqnorm(residuals(model2), main = "QQ of Log(Assets)")
qqline(residuals(model2))
qqnorm(residuals(model3), main = "QQ of Log(Assets) and Log(Market Value)")
qqline(residuals(model3))

# These graphs show a comparison of the Q-Q plots for the three models that have been evaluated. It appears that the model with a log transformation of assets is fairly parametric.