Design of Experiments: Project 1

Andreas Vought, Rensselaer Polytechnic Institute

Version 1, Rev. 10/11/2016

1. Setting

The Forbes 2000 Ranking of the Worlds Biggest Companies (Year 2004)

Load Forbes Dataset, containing 2000 entries & 8 variables Year 2004

Forbes = read.csv("Forbes2000.csv", header=TRUE, sep = ",")

System Under Test

Forbes dataset was collected to summarize their ranking of the top 2000 companies, measured by sales, profits, assets, and market value. It consists of 2000 observations on 8 variables.

head(Forbes) # shows the first 6 observations/entries

Factors and Levels

str(Forbes) # gives the structure of the dataset, showing factors and of their levels

summary(Forbes) # summarizes decriptive statistics, giving an idea of some of the basic behaviors of each variable

By using a function levels, or by simply examining the structure of Forbes dataset, we can identify the response variable (Rank) takes integer values 1-2000. Important categorical factors are country and category, which describe country to origin/headquarters, and the specific business segment each company serves. Numerical values for sales, profits, assets, and market value are also present, but exist at possibly up to 2000 levels. To follow instructions/requirements thoroughly, we will convert two of these into factors.

Forbes\(profits = as.factor(Forbes\)profits) #converts profits to factor Forbes\(marketvalue = as.factor(Forbes\)profits) #converts marketvalue to factor

str(Forbes) #examine new factors/levels of Forbes dataset

4 factors and their respective levels are listed below. These will be considered in relation to the response variable, Rank: * Country * Category * Profits * Marketvalue

Continuous Variables

Besides Rank, Country, and Category, variables included in this dataset are continuous. Sales, Profits, Assets, and Market Value can take potentially any value.

Response Variabl(s)

The response variable considered in this case is Rank, an integer ranging from 1-2000.

2. (Experimental) Design

An objective of this experiment is to generate a fitted model for company rank based on influential factors. We would like to determine which factors influence success of a company (measured mostly through metrics of size, although there are inarguably many more aspects to consider).

How will the experiment be organized and conducted to test the hypothesis?

First, exploratory data analysis and statistics will be applied to the dataset. This is an effort to determine the significance of each factor, as well as to determine main and interaction effects. This will utilize plots, ANOVA test, any other functions built into R that are understood at this point in time through our analysis of literature thusfar. The significance level, that is in essence the predictive power, will be examined to determine how effective a fitted model is, and whether variance in or between each factor is significant enough to negate the predictive power of the model and render it inadequate.

Data has already been collected and accessed through a search engine found on the 100 interesting datasets link. It compiles financial data from publicly available records of several thousand companies

To set a hypothesis test for this experiment, let us consider 4 factors with varying levels that may have both main and interaction effects on the ranking of 2000 companies. The null hypothesis such that: Ranking is not affected by country, category, profits, or market value. Now this sounds fairly unlikely, so lets use the tools we have to disprove the null hypothesis.

What is the rational for this design?

The experimental design in unbiased in that it collects data from public records and is not susceptible to exogenous factors which may confound results. Data was retrieved online, already cleaned and organized, and factors were chosen at my discretion. Country/Category were obvious selections, and two other continuous variables were converted to factors for exploration as well. There are likely hundreds of thousands of factors that exist and have measurable influence on company rankings, but of the 8 provided, 4 were selected. Little information is provided with the dataset as to why each factor was chosen, but the nature of the ranking system is something Forbes has been cultivating for years.

Randomize: What is the Randomization Scheme?

Randomization ideally balances out or negates the impact of nuisance factors or uncontrollable lurking variables. Blocking is used in this experiment, which is basically a restriction on randomization. Due to the nature of the collection of data, this experiment is not randomized.

Replicate: Are there replicates and/or repeated measures?

In any experiment, replications are important to demonstrate that results are representative of a whole population. As n approaches infinity, the sample approaches the population. This usually improves confidence and enhances randomization. In our study, there are 2000 entries. However, this is not indicative of replication. To use replication, it would be helpful to consider data from more than just 2004. A collection for a 10-20 year period would show repeated measurements and increase the validity of the study.

Block: Did you use blocking in the design?

Blocking is used to decrease the influence of nuisance factors. For this purpose, blocking is essentially used in that only 4 factors are considered to be predictive, so the others are basically treated as nuisance factors and are held constant or not considered through each experimental run. It is important to note that they, the so called nuisance factors, probably have some effect on the responce. However, they are not considered of interest in this experiment, so they are avoided or blocked out.(1)

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

summary(Forbes)
##        X               rank                              name     
##  Min.   :   1.0   Min.   :   1.0   Aareal Bank             :   1  
##  1st Qu.: 500.8   1st Qu.: 500.8   ABB Group               :   1  
##  Median :1000.5   Median :1000.5   Abbey National          :   1  
##  Mean   :1000.5   Mean   :1000.5   Abbott Laboratories     :   1  
##  3rd Qu.:1500.2   3rd Qu.:1500.2   Abercrombie & Fitch     :   1  
##  Max.   :2000.0   Max.   :2000.0   Abertis Infraestructuras:   1  
##                                    (Other)                 :1994  
##            country                      category        sales        
##  United States :751   Banking               : 313   Min.   :  0.010  
##  Japan         :316   Diversified financials: 158   1st Qu.:  2.018  
##  United Kingdom:137   Insurance             : 112   Median :  4.365  
##  Germany       : 65   Utilities             : 110   Mean   :  9.697  
##  France        : 63   Materials             :  97   3rd Qu.:  9.547  
##  Canada        : 56   Oil & gas operations  :  90   Max.   :256.330  
##  (Other)       :612   (Other)               :1120                    
##     profits             assets          marketvalue    
##  Min.   :-25.8300   Min.   :   0.270   Min.   :  0.02  
##  1st Qu.:  0.0800   1st Qu.:   4.025   1st Qu.:  2.72  
##  Median :  0.2000   Median :   9.345   Median :  5.15  
##  Mean   :  0.3811   Mean   :  34.042   Mean   : 11.88  
##  3rd Qu.:  0.4400   3rd Qu.:  22.793   3rd Qu.: 10.60  
##  Max.   : 20.9600   Max.   :1264.030   Max.   :328.54  
##  NA's   :5
country_int = as.integer(Forbes$country)
category_int = as.integer(Forbes$category)
profits_int = as.integer(Forbes$profits)
marketvalue_int = as.integer(Forbes$marketvalue)
hist(country_int)

By looking at the histogram, the frequency of each country can be interpreted (although with very poor lables assigned) The US has the most, followed by Japan, UK, Germany, France, Canada, and others. Country of origin is likely a highly predictive factor of company ranking.

hist(category_int)

By examining the category each company competes in, although again very poorly labeled, Banking has the most responses, followed by diversified financial and insurance. Categories related to capital management make up the top responses, with utilities, materials, and oil/gas operations near the top as well. The histograms are interesting to look at, but do not provide a clear enough picture of the distribution of each variable.

boxplot(Forbes$rank~Forbes$profits, data = Forbes, vertical = TRUE, ylab="Rank")

boxplot(Forbes$rank~Forbes$marketvalue, data = Forbes, vertical = TRUE, ylab="Rank")

boxplot(Forbes$rank~Forbes$country, data = Forbes, vertical = TRUE, ylab="Rank")

boxplot(Forbes$rank~Forbes$category, data = Forbes, vertical = TRUE, ylab="Rank")

It is difficult to get a sense of each region and industry. I would like to figure out how to make labels appear vertically so they can each correspond to a tick mark, allowing most or all of the plots to be accurately labeled. An interesting plot is that is profits, because the boxplots actually reflect a distribution which is quite logical. As expected, those with greater profits are closer to higher ranks (1-250), while lower profits and larger spreads are worse in the rankings. Examining rank compared to country, those countries with more responses have a wider spread and are susceptible to outliers.

Testing

Analysis of variance (ANOVA) will be used to test and measure variation among and between sample groups. (These are main effects)

ANOVA1 = aov(Forbes$rank ~ Forbes$country)
ANOVA1
## Call:
##    aov(formula = Forbes$rank ~ Forbes$country)
## 
## Terms:
##                 Forbes$country Residuals
## Sum of Squares        42788048 623878452
## Deg. of Freedom             60      1939
## 
## Residual standard error: 567.2325
## Estimated effects may be unbalanced
summary(ANOVA1)
##                  Df    Sum Sq Mean Sq F value   Pr(>F)    
## Forbes$country   60  42788048  713134   2.216 3.81e-07 ***
## Residuals      1939 623878452  321753                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA2 = aov(Forbes\(rank ~ Forbes\)category) ANOVA2 summary(ANOVA2)

ANOVA3 = aov(Forbes\(rank ~ Forbes\)profits) ANOVA3 summary(ANOVA3)

ANOVA4 = aov(Forbes\(rank ~ Forbes\)marketvalue) ANOVA4 summary(ANOVA4) ``` The results of the analysis of variance leads me to reject the null hypothesis. According to the table and the significance associated with P value, we cannot say that variance in rank is not effected by these factors. They indeed have an affect, as expected.

The following anova tables summarize results for interaction effects. They can also be represented by interaction plots.

ANOVA5 = aov(Forbes$rank ~ Forbes$country*Forbes$category)
ANOVA5
## Call:
##    aov(formula = Forbes$rank ~ Forbes$country * Forbes$category)
## 
## Terms:
##                 Forbes$country Forbes$category
## Sum of Squares        42788048        31827343
## Deg. of Freedom             60              26
##                 Forbes$country:Forbes$category Residuals
## Sum of Squares                       150795868 441255241
## Deg. of Freedom                            376      1537
## 
## Residual standard error: 535.8065
## 1184 out of 1647 effects not estimable
## Estimated effects may be unbalanced
summary(ANOVA5)
##                                  Df    Sum Sq Mean Sq F value   Pr(>F)    
## Forbes$country                   60  42788048  713134   2.484 5.56e-09 ***
## Forbes$category                  26  31827343 1224129   4.264 6.08e-12 ***
## Forbes$country:Forbes$category  376 150795868  401053   1.397 1.02e-05 ***
## Residuals                      1537 441255241  287089                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA6 = aov(Forbes$rank ~ Forbes$country*Forbes$profits)
ANOVA6
## Call:
##    aov(formula = Forbes$rank ~ Forbes$country * Forbes$profits)
## 
## Terms:
##                 Forbes$country Forbes$profits
## Sum of Squares        42961784       55257043
## Deg. of Freedom             60              1
##                 Forbes$country:Forbes$profits Residuals
## Sum of Squares                       70964592 496409145
## Deg. of Freedom                            46      1887
## 
## Residual standard error: 512.9015
## 14 out of 122 effects not estimable
## Estimated effects may be unbalanced
## 5 observations deleted due to missingness
summary(ANOVA6)
##                                 Df    Sum Sq  Mean Sq F value   Pr(>F)    
## Forbes$country                  60  42961784   716030   2.722 6.74e-11 ***
## Forbes$profits                   1  55257043 55257043 210.049  < 2e-16 ***
## Forbes$country:Forbes$profits   46  70964592  1542709   5.864  < 2e-16 ***
## Residuals                     1887 496409145   263068                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5 observations deleted due to missingness
ANOVA7 = aov(Forbes$rank ~ Forbes$country*Forbes$marketvalue)
ANOVA7
## Call:
##    aov(formula = Forbes$rank ~ Forbes$country * Forbes$marketvalue)
## 
## Terms:
##                 Forbes$country Forbes$marketvalue
## Sum of Squares        42788048          141081290
## Deg. of Freedom             60                  1
##                 Forbes$country:Forbes$marketvalue Residuals
## Sum of Squares                           72367424 410429738
## Deg. of Freedom                                46      1892
## 
## Residual standard error: 465.7564
## 14 out of 122 effects not estimable
## Estimated effects may be unbalanced
summary(ANOVA7)
##                                     Df    Sum Sq   Mean Sq F value
## Forbes$country                      60  42788048    713134   3.287
## Forbes$marketvalue                   1 141081290 141081290 650.357
## Forbes$country:Forbes$marketvalue   46  72367424   1573205   7.252
## Residuals                         1892 410429738    216929        
##                                     Pr(>F)    
## Forbes$country                    1.71e-15 ***
## Forbes$marketvalue                 < 2e-16 ***
## Forbes$country:Forbes$marketvalue  < 2e-16 ***
## Residuals                                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA8 = aov(Forbes$rank ~ Forbes$category*Forbes$profits)
ANOVA8
## Call:
##    aov(formula = Forbes$rank ~ Forbes$category * Forbes$profits)
## 
## Terms:
##                 Forbes$category Forbes$profits
## Sum of Squares         32946310       59956701
## Deg. of Freedom              26              1
##                 Forbes$category:Forbes$profits Residuals
## Sum of Squares                        70396270 502293283
## Deg. of Freedom                             26      1941
## 
## Residual standard error: 508.7049
## Estimated effects may be unbalanced
## 5 observations deleted due to missingness
summary(ANOVA8)
##                                  Df    Sum Sq  Mean Sq F value   Pr(>F)
## Forbes$category                  26  32946310  1267166   4.897 9.58e-15
## Forbes$profits                    1  59956701 59956701 231.689  < 2e-16
## Forbes$category:Forbes$profits   26  70396270  2707549  10.463  < 2e-16
## Residuals                      1941 502293283   258781                 
##                                   
## Forbes$category                ***
## Forbes$profits                 ***
## Forbes$category:Forbes$profits ***
## Residuals                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5 observations deleted due to missingness
ANOVA9 = aov(Forbes$rank ~ Forbes$category*Forbes$marketvalue)
ANOVA9
## Call:
##    aov(formula = Forbes$rank ~ Forbes$category * Forbes$marketvalue)
## 
## Terms:
##                 Forbes$category Forbes$marketvalue
## Sum of Squares         32971553          150721547
## Deg. of Freedom              26                  1
##                 Forbes$category:Forbes$marketvalue Residuals
## Sum of Squares                            89156625 393816775
## Deg. of Freedom                                 26      1946
## 
## Residual standard error: 449.8582
## Estimated effects may be unbalanced
summary(ANOVA9)
##                                      Df    Sum Sq   Mean Sq F value Pr(>F)
## Forbes$category                      26  32971553   1268137   6.266 <2e-16
## Forbes$marketvalue                    1 150721547 150721547 744.773 <2e-16
## Forbes$category:Forbes$marketvalue   26  89156625   3429101  16.945 <2e-16
## Residuals                          1946 393816775    202372               
##                                       
## Forbes$category                    ***
## Forbes$marketvalue                 ***
## Forbes$category:Forbes$marketvalue ***
## Residuals                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA9 = aov(Forbes$rank ~ Forbes$profits*Forbes$marketvalue)
ANOVA9
## Call:
##    aov(formula = Forbes$rank ~ Forbes$profits * Forbes$marketvalue)
## 
## Terms:
##                 Forbes$profits Forbes$marketvalue
## Sum of Squares        61675431           99179668
## Deg. of Freedom              1                  1
##                 Forbes$profits:Forbes$marketvalue Residuals
## Sum of Squares                           66881795 437855670
## Deg. of Freedom                                 1      1991
## 
## Residual standard error: 468.9536
## Estimated effects may be unbalanced
## 5 observations deleted due to missingness
summary(ANOVA9)
##                                     Df    Sum Sq  Mean Sq F value Pr(>F)
## Forbes$profits                       1  61675431 61675431   280.4 <2e-16
## Forbes$marketvalue                   1  99179668 99179668   451.0 <2e-16
## Forbes$profits:Forbes$marketvalue    1  66881795 66881795   304.1 <2e-16
## Residuals                         1991 437855670   219917               
##                                      
## Forbes$profits                    ***
## Forbes$marketvalue                ***
## Forbes$profits:Forbes$marketvalue ***
## Residuals                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5 observations deleted due to missingness
interaction.plot(Forbes$category, Forbes$country, Forbes$rank)

interaction.plot(Forbes$profits, Forbes$country, Forbes$rank)

interaction.plot(Forbes$marketvalue, Forbes$country, Forbes$rank)

interaction.plot(Forbes$category, Forbes$profits, Forbes$rank)

interaction.plot(Forbes$category, Forbes$marketvalue, Forbes$rank)

It is difficult to determine the significance of the interaction effects between factors. Given low P values, I suspect there is an interaction effect. It is however extremely hard for me to grasp the effect given a lack of understanding and interpretation of the interaction plots.

Estimation

Skip

Diagnostics / Model Adequacy Checking

Skip

Plotting residuals vs fitted values is a good method for checking the accuracy of the model. Normal Q-Q plot and a fitted line should be an adequate test.

4. References

http://www.stat.washington.edu/pds/stat502/LectureNotes/RCBD.pdf I used the definitions and principals of blocking, replication, and randomization discussed in the pdf prepared for University of Washington

https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Also used this resource for generating the rmf document. Extremely helpful sheet, would recommend it to anyone who has questions about R markdown.

Contingencies to the experimental design and/or analysis, and summary of relevant theory (6020 requirements)

Unlike most experimental results, it is quite unlikely that these data are the result of chance. There are however many additional factors which should affect company rank, but in the purpose of this study, I think the data includes factors that specifically are used to dictate rank, and therefore they should be extremely predictive. These data do not follow all the principals of factorial experimental design. Namely, randomization and replication, but these are not necessarily needed to show that this model is a good fit. The model is inherently good because the ranking uses a combination of the factors analyzed to generate the list. It is difficult to randomize the data collection process, because the responses are not really up for interpretation, they are based off reported stats. Blocking is used to narrow the focus of the analysis. Typically, it is used to group experimental runs in ways such as to reduce variability, caused by nuisance factors.

5. Appendicies (R Code Lines)

Forbes = read.csv("Forbes2000.csv", header=TRUE, sep = ",")
#Load Forbes Dataset, containing 2000 entries & 8 variables Year 2004

head(Forbes) # shows the first 6 observations/entries
str(Forbes) # gives the structure of the dataset, showing factors and of their levels

summary(Forbes) # summarizes discritive statistics, giving an idea of some of the basic behaviors of each variable

country_int = as.integer(Forbes$country)
category_int = as.integer(Forbes$category)
profits_int = as.integer(Forbes$profits)
marketvalue_int = as.integer(Forbes$marketvalue)

hist(country_int)
hist(category_int)

boxplot(Forbes$rank~Forbes$profits, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$marketvalue, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$country, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$category, data = Forbes, vertical = TRUE, ylab="Rank")

ANOVA1 = aov(Forbes$rank ~ Forbes$country)
ANOVA1
summary(ANOVA1)

ANOVA2 = aov(Forbes$rank ~ Forbes$category)
ANOVA2
summary(ANOVA2)

ANOVA3 = aov(Forbes$rank ~ Forbes$profits)
ANOVA3
summary(ANOVA3)

ANOVA4 = aov(Forbes$rank ~ Forbes$marketvalue)
ANOVA4
summary(ANOVA4)

ANOVA5 = aov(Forbes$rank ~ Forbes$country*Forbes$category)
ANOVA5
summary(ANOVA5)

ANOVA6 = aov(Forbes$rank ~ Forbes$country*Forbes$profits)
ANOVA6
summary(ANOVA6)

ANOVA7 = aov(Forbes$rank ~ Forbes$country*Forbes$marketvalue)
ANOVA7
summary(ANOVA7)

ANOVA8 = aov(Forbes$rank ~ Forbes$category*Forbes$profits)
ANOVA8
summary(ANOVA8)

ANOVA9 = aov(Forbes$rank ~ Forbes$category*Forbes$marketvalue)
ANOVA9
summary(ANOVA9)

ANOVA9 = aov(Forbes$rank ~ Forbes$profits*Forbes$marketvalue)
ANOVA9
summary(ANOVA9)

interaction.plot(Forbes$category, Forbes$country, Forbes$rank)
interaction.plot(Forbes$profits, Forbes$country, Forbes$rank)
interaction.plot(Forbes$marketvalue, Forbes$country, Forbes$rank)
interaction.plot(Forbes$category, Forbes$profits, Forbes$rank)
interaction.plot(Forbes$category, Forbes$marketvalue, Forbes$rank)