The data used in this recipe was found using the “100+ interesting data sets” webpage.
This dataset has information from 79 companies which were selected as part of the Forbes 500 list in 1986. The comprehensive list of Forbes 500 includes more than 800 companies, because Forbes selects the top 500 in any of its selection criteria.
# Read in dataset from URL
Forbes500 <- read.delim("C:\\Users\\braunj6\\AppData\\Local\\Temp\\RtmpaC9pEu\\data1e2419cc593a")
#Rename for simplicity
x <- Forbes500
# Recategorize data as numeric from integers
x$Assets <- as.numeric(x$Assets)
x$Sales <- as.numeric(x$Sale)
x$Market_Value <- as.numeric(x$Market_Value)
The data being examined has 79 observations of 8 variables.
head(x)
## Company Assets Sales Market_Value Profits Cash_Flow
## 1 Air Products 2687 1870 1890 145.7 352.2
## 2 Allied Signal 13271 9115 8190 -279.0 83.0
## 3 American Electric Power 13621 4848 4572 485.0 898.9
## 4 American Savings Bank FSB 3614 367 90 14.1 24.6
## 5 AMR 6425 6131 2448 345.8 682.5
## 6 Apple Computer 1022 1754 1370 72.0 119.5
## Employees sector
## 1 18.2 Other
## 2 143.8 Other
## 3 23.4 Energy
## 4 1.1 Finance
## 5 49.5 Transportation
## 6 4.8 HiTech
summary(x)
## Company Assets Sales
## AH Robins : 1 Min. : 223 Min. : 176.0
## Air Products : 1 1st Qu.: 1122 1st Qu.: 815.5
## Allied Signal : 1 Median : 2788 Median : 1754.0
## American Electric Power : 1 Mean : 5941 Mean : 4178.3
## American Savings Bank FSB: 1 3rd Qu.: 5802 3rd Qu.: 4563.5
## AMR : 1 Max. :52634 Max. :50056.0
## (Other) :73
## Market_Value Profits Cash_Flow Employees
## Min. : 53.0 Min. :-771.5 Min. :-651.90 Min. : 0.60
## 1st Qu.: 512.5 1st Qu.: 39.0 1st Qu.: 75.15 1st Qu.: 3.95
## Median : 944.0 Median : 70.5 Median : 133.30 Median : 15.40
## Mean : 3269.8 Mean : 209.8 Mean : 400.93 Mean : 37.60
## 3rd Qu.: 1961.5 3rd Qu.: 188.1 3rd Qu.: 328.85 3rd Qu.: 48.50
## Max. :95697.0 Max. :6555.0 Max. :9874.00 Max. :400.20
##
## sector
## Finance :17
## Energy :15
## Manufacturing:10
## Retail :10
## HiTech : 8
## Other : 7
## (Other) :12
# As seen below, there are 79 observations of 8 variables, including: 2 factors and 5 numeric variables.
str(x)
## 'data.frame': 79 obs. of 8 variables:
## $ Company : Factor w/ 79 levels "AH Robins","Air Products",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ Assets : num 2687 13271 13621 3614 6425 ...
## $ Sales : num 1870 9115 4848 367 6131 ...
## $ Market_Value: num 1890 8190 4572 90 2448 ...
## $ Profits : num 145.7 -279 485 14.1 345.8 ...
## $ Cash_Flow : num 352.2 83 898.9 24.6 682.5 ...
## $ Employees : num 18.2 143.8 23.4 1.1 49.5 ...
## $ sector : Factor w/ 9 levels "Communication",..: 7 7 2 3 9 4 5 7 3 1 ...
The continuous variables in this dataset are assets, sales, market value, profits and cash flow, with each of those variables being in millions of dollars.
The response variable being analyzed in this dataset is the amount of assets each company has.
summary(x$Assets)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 223 1122 2788 5941 5802 52630
The data is organized into 8 columns. There are 2 factors and 6 variables.
There was no randomization in this recipe because of the data points are based on observations about the Forbes 500 companies.
The experiment will use an analysis of covariance (ANCOVA). It will use a single factor - market sector - which has 9 levels. It will also use multiple independent explanatory variables, such as sales, market value, profits, and cashflows, in order to predict the response variable - assets.
Analysis of Covarariance is a blend of ANOVA and regression, and is used to analyze the effect of a categorical variable on a contiuous dependent variable. An ANCOVA controls for the effects of other continuous variables, which vary along with the dependent variable. These continuous control variables are the covariates. In this design, the categorial factor is market sector, the continous dependent (response) variable in assets, and the covariates are ales, market value, profits, and cashflows. An ANCOVA is used because we want to see the effect that market sector has on amount of assets, but we need to control for the other continuous variables, which cannotet be randomized, but can be measured.
The experiment was based on observations, and was therefore not randomized.
The experiment had no repeated measures.
There was no blocking in this recipe.
attach(x)
The plot below compares assets, sales, and market value as dependent and independent variables. The graphs horizonal to a given variable use that variable as an independent variable, and the graphs vertical to the given variable uses that variable as a dependent variable. For example, the graph in the top-middle placement shows a generally positive correlation between sales and assets, with assets as the response (independent) variable. Because “assets” is the response variable of interest, the top row of graphs is of most interest.
plot(x[,c(2,3,4)], main = "Cross Comparison of Assets, Sales, and Market Value (in $M)")
As before, the plot below compares three different continuous variables, but this time, assets, sales, and market value were evaluated as dependent and independent variables. The graphs horizonal to a given variable use that variable as an independent variable, and the graphs vertical to the given variable uses that variable as a dependent variable. For example, the graph in the top-middle placement shows a generally positive correlation between sales and assets, with assets as the response (independent) variable.
plot(x[,c(3,4,5)], main = "Cross Comparison of Sales, Market Value, and Profits (in $M)")
The plot below shows the relationship between sales and assets. As stated before, it appears that there is a generally positive correlation between sales and assets, with assets as the response variable.
plot(Assets~Sales, main = "Assets by Sales", xlab = "Sales (in million dollars)", ylab = "Assets (in million dollars)")
The relationship between sales and profits seems to have neither a positive nor negative correlation. With the exception of two outliers, the relationship seems relatively horizontal.
plot(Profits~Sales, main = "Profits by Sales", xlab = "Sales (in million dollars)", ylab = "Profits (in million dollars)")
Similarly to the last graph, assets does not seem to predict profits. As assets increase, there are both increased and decreased profits.
plot(Profits~Assets, main = "Profits by Assets", xlab = "Assets (in million dollars)", ylab = "Profits (in million dollars)")
The boxplot below shows the relationship between the various sectors and their associated profits. Hi-tech and communication both have the largest ranges, and slightly greater medians than the other sectors.
boxplot(Profits~sector, main = "Profits by Sector", xlab = "Market Sector", ylab = "Profits (in million dollars)")
The boxplot below shows the relationship between the various market sectors and their associated assets. The communication sector has the greatest median assets. The hi-tech sector has the greatest range in assets. Medical seems to have the least median assets
boxplot(Assets~sector, main = "Assets by Sector", xlab = "Market Sector", ylab = "Assets (in million dollars)")
detach(x)
The focus of this recipe was on a single-factor, multiple dependent variable analysis of covariance.
#ANCOVA Testing
#Comparing the assets of various market sectors, given the profits, cash flows, and market values of those groups.
# H0: There is no difference between the mean assets of the various sectors. The variation in assets is not caused by anything other than randomization.
# Ha: The difference in variation of assets can caused by something other than randomization.
model1 <- lm(Assets~Profits + Cash_Flow + Market_Value + sector , data = x)
anova(model1)
## Analysis of Variance Table
##
## Response: Assets
## Df Sum Sq Mean Sq F value Pr(>F)
## Profits 1 2367764664 2367764664 80.3273 4.466e-13 ***
## Cash_Flow 1 618926766 618926766 20.9973 2.061e-05 ***
## Market_Value 1 632445688 632445688 21.4560 1.719e-05 ***
## sector 8 945981531 118247691 4.0116 0.0006167 ***
## Residuals 67 1974923534 29476471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of the ANCOVA gives the p-values for each factor and variable. The null hypothesis states that the variation in the response variable, assets (in millions of dollars), cannot be explained by anything other than randomization. The p-values in this ANCOVA support that this is not the case. The p-values for profits, cash flow, market value, and sector are less than an alpha of 0.05. Therefore, it is likely that the variation in assets can be explained by these factors.
Tukey’s range tests are used alongside ANCOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.
The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means. In this Tukey test, the difference between the means of pairs of sectors were compared.
tukey1 <- TukeyHSD(aov(Assets ~ sector, data = x))
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Assets ~ sector, data = x)
##
## $sector
## diff lwr upr p adj
## Energy-Communication -8347.0000 -29525.014 12831.014 0.9389272
## Finance-Communication -2100.7647 -23131.707 18930.178 0.9999964
## HiTech-Communication -228.6250 -22469.988 22012.738 1.0000000
## Manufacturing-Communication -8123.8000 -29915.796 13668.196 0.9553649
## Medical-Communication -10557.0000 -34921.192 13807.192 0.8990303
## Other-Communication -9336.5714 -31893.430 13220.288 0.9204048
## Retail-Communication -10147.1000 -31939.096 11644.896 0.8562152
## Transportation-Communication -9951.1667 -32921.948 13019.614 0.8991371
## Finance-Energy 6246.2353 -3719.889 16212.359 0.5444965
## HiTech-Energy 8118.3750 -4198.334 20435.084 0.4752950
## Manufacturing-Energy 223.2000 -11262.190 11708.590 1.0000000
## Medical-Energy -2210.0000 -18041.518 13621.518 0.9999517
## Other-Energy -989.5714 -13867.265 11888.122 0.9999995
## Retail-Energy -1800.1000 -13285.490 9685.290 0.9998836
## Transportation-Energy -1604.1667 -15193.864 11985.531 0.9999867
## HiTech-Finance 1872.1397 -10189.933 13934.212 0.9998919
## Manufacturing-Finance -6023.0353 -17234.925 5188.854 0.7326352
## Medical-Finance -8456.2353 -24090.468 7177.998 0.7254055
## Other-Finance -7235.8067 -19870.175 5398.561 0.6603990
## Retail-Finance -8046.3353 -19258.225 3165.554 0.3580328
## Transportation-Finance -7850.4020 -21209.749 5508.945 0.6287802
## Manufacturing-HiTech -7895.1750 -21239.993 5449.643 0.6202368
## Medical-HiTech -10328.3750 -27556.461 6899.711 0.6033144
## Other-HiTech -9107.9464 -23668.336 5452.443 0.5471248
## Retail-HiTech -9918.4750 -23263.293 3426.343 0.3116804
## Transportation-HiTech -9722.5417 -24916.285 5471.202 0.5161836
## Medical-Manufacturing -2433.2000 -19077.112 14210.712 0.9999313
## Other-Manufacturing -1212.7714 -15077.036 12651.493 0.9999987
## Retail-Manufacturing -2023.3000 -14604.915 10558.315 0.9998587
## Transportation-Manufacturing -1827.3667 -16355.364 12700.631 0.9999783
## Other-Medical 1220.4286 -16413.079 18853.936 0.9999998
## Retail-Medical 409.9000 -16234.012 17053.812 1.0000000
## Transportation-Medical 605.8333 -17554.163 18765.830 1.0000000
## Retail-Other -810.5286 -14674.793 13053.736 0.9999999
## Transportation-Other -614.5952 -16266.544 15037.354 1.0000000
## Transportation-Retail 195.9333 -14332.064 14723.931 1.0000000
plot(tukey1)
The results of the Tukey Test display all p-values greater than an alpha, 0.05 between all pairs of sectors. This leads us to fail to reject the null. Therefore, there is some probability that there is no significant difference between the mean assets of the pairs of sectors. The plot of the Tukey also supports this claim. The 95% confidence intervals of the pair’s difference still include 0 as a difference in mean levels. Therefore, “no difference” can not be not included from the possibilities.
Assumptions for ANCOVA tests include:
* the model must include at least one categorical and at least one continuous independ variable - in our model, we have one categorical, and three independent continuous variables.
* The dependent variable must in continuous - “Assets” is a continuous variable
* Normality of the response variable and residuals - normality will be evaluated below:
par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))
The Q-Q Normality Plots of the residuals for this model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.
plot(fitted(model1),residuals(model1))
The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the model do not appear to be symmetric over the length of the dynamic range, and are not symmetric over 0. There is a large cluster on the lower part of the range.
shapiro.test(x$Assets)
##
## Shapiro-Wilk normality test
##
## data: x$Assets
## W = 0.5848, p-value = 1.314e-13
The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the assets data is not normally distributed.
Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.
A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.
kruskal.test(Assets ~ sector, data = x)
##
## Kruskal-Wallis rank sum test
##
## data: Assets by sector
## Kruskal-Wallis chi-squared = 16.0199, df = 8, p-value = 0.0421
Unfortunately, with the kruskal-wallis test, we could only analyze how the various sector groups’ mean varied. The p-value of this Kruskal-Wallis test is just barely less than an alpha of 0.05. Therefore, you must reject the null - that variation in assets cannot be explained by anything other than randomization. Therefore, variation can be explained, in part, market sector.
Especially in financial data, data transformation can be used to create a better fit for analysis.
# The first transformation performed was done by taking the log of the response variable - assets.
x1 <- cbind(x, log(x$Assets)) # add transformed variable as a column
str(x1) # structure of new dataframe with transformed response variable.
## 'data.frame': 79 obs. of 9 variables:
## $ Company : Factor w/ 79 levels "AH Robins","Air Products",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ Assets : num 2687 13271 13621 3614 6425 ...
## $ Sales : num 1870 9115 4848 367 6131 ...
## $ Market_Value : num 1890 8190 4572 90 2448 ...
## $ Profits : num 145.7 -279 485 14.1 345.8 ...
## $ Cash_Flow : num 352.2 83 898.9 24.6 682.5 ...
## $ Employees : num 18.2 143.8 23.4 1.1 49.5 ...
## $ sector : Factor w/ 9 levels "Communication",..: 7 7 2 3 9 4 5 7 3 1 ...
## $ log(x$Assets): num 7.9 9.49 9.52 8.19 8.77 ...
model2 <- lm(log(x$Assets)~ Profits + Cash_Flow + Market_Value + sector , data = x1)
anova(model2)
## Analysis of Variance Table
##
## Response: log(x$Assets)
## Df Sum Sq Mean Sq F value Pr(>F)
## Profits 1 14.454 14.4537 17.4124 8.859e-05 ***
## Cash_Flow 1 18.730 18.7300 22.5641 1.113e-05 ***
## Market_Value 1 1.184 1.1839 1.4263 0.236584
## sector 8 24.880 3.1100 3.7467 0.001121 **
## Residuals 67 55.615 0.8301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# As seen in the summary of the model, profits, cash flow, and sector all have significant p-values, meaning that the variation in assets can be explained by something other than variation. In this case, it suggests that profits, cash flow, and sector can all help to explain this variation.
plot(fitted(model2),residuals(model2))
# When compared to the original mode, these residuals are significantly more scattered. Also, they appear to be more symmetric over the dynamic range.
par(mfrow = c(1,1))
qqnorm(residuals(model2))
qqline(residuals(model2))
# The q-q plot appears to be much more normally distributed. The data points are quite linear.
# The next transformation was completed by taking the log of market value.
x2 <- cbind(x1, log(x$Market_Value))
str(x2)
## 'data.frame': 79 obs. of 10 variables:
## $ Company : Factor w/ 79 levels "AH Robins","Air Products",..: 2 3 4 5 6 7 8 9 10 11 ...
## $ Assets : num 2687 13271 13621 3614 6425 ...
## $ Sales : num 1870 9115 4848 367 6131 ...
## $ Market_Value : num 1890 8190 4572 90 2448 ...
## $ Profits : num 145.7 -279 485 14.1 345.8 ...
## $ Cash_Flow : num 352.2 83 898.9 24.6 682.5 ...
## $ Employees : num 18.2 143.8 23.4 1.1 49.5 ...
## $ sector : Factor w/ 9 levels "Communication",..: 7 7 2 3 9 4 5 7 3 1 ...
## $ log(x$Assets) : num 7.9 9.49 9.52 8.19 8.77 ...
## $ log(x$Market_Value): num 7.54 9.01 8.43 4.5 7.8 ...
model3 <- lm(log(x$Assets)~ Profits + Cash_Flow + log(x$Market_Value) + sector , data = x1)
anova(model2)
## Analysis of Variance Table
##
## Response: log(x$Assets)
## Df Sum Sq Mean Sq F value Pr(>F)
## Profits 1 14.454 14.4537 17.4124 8.859e-05 ***
## Cash_Flow 1 18.730 18.7300 22.5641 1.113e-05 ***
## Market_Value 1 1.184 1.1839 1.4263 0.236584
## sector 8 24.880 3.1100 3.7467 0.001121 **
## Residuals 67 55.615 0.8301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Similarly to the second model, this model results in significant p-values for profits, cash flow, and sector, and insignificant p-values for market value. The significant p-values indicates that the variation in assets (now log of assets) can be explained by something other than randomization - in this case, profits, cash flow, and market sector.
plot(fitted(model3),residuals(model3))
# This model appears to be less symmetric than the previous model. The fitted points vs. residuals would appear more symmetric across the dynamic range had it not been for the two points that seem to be outliers.
par(mfrow = c(1,1))
qqnorm(residuals(model3))
qqline(residuals(model3))
# While this Q-Q plot is still better than the original, these data points are less linear than the previous model. Therefore, it appears that this model results in not as strong of parametric residuals.
par(mfrow=c(1,3))
qqnorm(residuals(model1), main = "QQ of Original")
qqline(residuals(model1))
qqnorm(residuals(model2), main = "QQ of Log(Assets)")
qqline(residuals(model2))
qqnorm(residuals(model3), main = "QQ of Log(Assets) and Log(Market Value)")
qqline(residuals(model3))
# These graphs show a comparison of the Q-Q plots for the three models that have been evaluated. It appears that the model with a log transformation of assets is fairly parametric.