The Forbes 2000 Ranking of the Worlds Biggest Companies (Year 2004)
Forbes = read.csv("Forbes2000.csv", header=TRUE, sep = ",")
Forbes dataset was collected to summarize their ranking of the top 2000 companies, measured by sales, profits, assets, and market value. It consists of 2000 observations on 8 variables.
head(Forbes) # shows the first 6 observations/entries
str(Forbes) # gives the structure of the dataset, showing factors and of their levels
summary(Forbes) # summarizes decriptive statistics, giving an idea of some of the basic behaviors of each variable
By using a function levels, or by simply examining the structure of Forbes dataset, we can identify the response variable (Rank) takes integer values 1-2000. Important categorical factors are country and category, which describe country to origin/headquarters, and the specific business segment each company serves. Numerical values for sales, profits, assets, and market value are also present, but exist at possibly up to 2000 levels. To follow instructions/requirements thoroughly, we will convert two of these into factors.
Forbes\(profits <- as.factor(Forbes\)profits) #converts profits to factor Forbes\(marketvalue <- as.factor(Forbes\)profits) #converts marketvalue to factor
str(Forbes) #examine new factors/levels of Forbes dataset
4 factors and their respective levels are listed below. These will be considered in relation to the response variable, Rank: Country Category Profits Marketvalue
Besides Rank, Country, and Category, variables included in this dataset are continuous. Sales, Profits, Assets, and Market Value can take potentially any value.
The response variable considered in this case is Rank, an integer ranging from 1-2000.
An objective of this experiment is to generate a fitted model for company rank based on influential factors. We would like to determine which factors influence success of a company (measured mostly through metrics of size, although there are inarguably many more aspects to consider).
First, exploratory data analysis and statistics will be applied to the dataset. This is an effort to determine the significance of each factor, as well as to determine main and interaction effects. This will utilize plots, ANOVA test, any other functions built into R that are understood at this point in time through our analysis of literature thusfar. The significance level, that is in essence the predictive power, will be examined to determine how effective a fitted model is, and whether variance in or between each factor is significant enough to negate the predictive power of the model and render it inadequate.
Data has already been collected and accessed through a search engine found on the 100 interesting datasets link. It compiles financial data from publicly available records of several thousand companies
To set a hypothesis test for this experiment, let us consider 4 factors with varying levels that may have both main and interaction effects on the ranking of 2000 companies. The null hypothesis such that: Ranking is not affected by country, category, profits, or market value. Now this sounds fairly unlikely, so lets use the tools we have to disprove the null hypothesis.
The experimental design in unbiased in that it collects data from public records and is not susceptible to exogenous factors which may confound results. Data was retrieved online, already cleaned and organized, and factors were chosen at my discretion. Country/Category were obvious selections, and two other continuous variables were converted to factors for exploration as well. There are likely hundreds of thousands of factors that exist and have measurable influence on company rankings, but of the 8 provided, 4 were selected. Little information is provided with the dataset as to why each factor was chosen, but the nature of the ranking system is something Forbes has been cultivating for years.
Randomization ideally balances out or negates the impact of nuisance factors or uncontrollable lurking variables. Blocking is used in this experiment, which is basically a restriction on randomization. Due to the nature of the collection of data, this experiment is not randomized.
In any experiment, replications are important to demonstrate that results are representative of a whole population. As n approaches infinity, the sample approaches the population. This usually improves confidence and enhances randomization. In our study, there are 2000 entries. However, this is not indicative of replication. To use replication, it would be helpful to consider data from more than just 2004. A collection for a 10-20 year period would show repeated measurements and increase the validity of the study.
Blocking is used to decrease the influence of nuisance factors. For this purpose, blocking is essentially used in that only 4 factors are considered to be predictive, so the others are basically treated as nuisance factors and are held constant or not considered through each experimental run. It is important to note that they, the so called nuisance factors, probably have some effect on the responce. However, they are not considered of interest in this experiment, so they are avoided or blocked out.(1)
summary(Forbes)
## X rank name
## Min. : 1.0 Min. : 1.0 Aareal Bank : 1
## 1st Qu.: 500.8 1st Qu.: 500.8 ABB Group : 1
## Median :1000.5 Median :1000.5 Abbey National : 1
## Mean :1000.5 Mean :1000.5 Abbott Laboratories : 1
## 3rd Qu.:1500.2 3rd Qu.:1500.2 Abercrombie & Fitch : 1
## Max. :2000.0 Max. :2000.0 Abertis Infraestructuras: 1
## (Other) :1994
## country category sales
## United States :751 Banking : 313 Min. : 0.010
## Japan :316 Diversified financials: 158 1st Qu.: 2.018
## United Kingdom:137 Insurance : 112 Median : 4.365
## Germany : 65 Utilities : 110 Mean : 9.697
## France : 63 Materials : 97 3rd Qu.: 9.547
## Canada : 56 Oil & gas operations : 90 Max. :256.330
## (Other) :612 (Other) :1120
## profits assets marketvalue
## Min. :-25.8300 Min. : 0.270 Min. : 0.02
## 1st Qu.: 0.0800 1st Qu.: 4.025 1st Qu.: 2.72
## Median : 0.2000 Median : 9.345 Median : 5.15
## Mean : 0.3811 Mean : 34.042 Mean : 11.88
## 3rd Qu.: 0.4400 3rd Qu.: 22.793 3rd Qu.: 10.60
## Max. : 20.9600 Max. :1264.030 Max. :328.54
## NA's :5
country_int = as.integer(Forbes$country)
category_int = as.integer(Forbes$category)
profits_int = as.integer(Forbes$profits)
marketvalue_int = as.integer(Forbes$marketvalue)
hist(country_int)
By looking at the histogram, the frequency of each country can be interpreted (although with very poor lables assigned) The US has the most, followed by Japan, UK, Germany, France, Canada, and others. Country of origin is likely a highly predictive factor of company ranking.
hist(category_int)
By examining the category each company competes in, although again very poorly labeled, Banking has the most responses, followed by diversified financial and insurance. Categories related to capital management make up the top responses, with utilities, materials, and oil/gas operations near the top as well. The histograms are interesting to look at, but do not provide a clear enough picture of the distribution of each variable.
boxplot(Forbes$rank~Forbes$profits, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$marketvalue, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$country, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$category, data = Forbes, vertical = TRUE, ylab="Rank")
It is difficult to get a sense of each region and industry. I would like to figure out how to make labels appear vertically so they can each correspond to a tick mark, allowing most or all of the plots to be accurately labeled. An interesting plot is that is profits, because the boxplots actually reflect a distribution which is quite logical. As expected, those with greater profits are closer to higher ranks (1-250), while lower profits and larger spreads are worse in the rankings. Examining rank compared to country, those countries with more responses have a wider spread and are susceptible to outliers.
Analysis of variance (ANOVA) will be used to test and measure variation among and between sample groups. (These are main effects)
ANOVA1 = aov(Forbes$rank ~ Forbes$country)
ANOVA1
## Call:
## aov(formula = Forbes$rank ~ Forbes$country)
##
## Terms:
## Forbes$country Residuals
## Sum of Squares 42788048 623878452
## Deg. of Freedom 60 1939
##
## Residual standard error: 567.2325
## Estimated effects may be unbalanced
summary(ANOVA1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forbes$country 60 42788048 713134 2.216 3.81e-07 ***
## Residuals 1939 623878452 321753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA2 = aov(Forbes\(rank ~ Forbes\)category) ANOVA2 summary(ANOVA2)
ANOVA3 = aov(Forbes\(rank ~ Forbes\)profits) ANOVA3 summary(ANOVA3)
ANOVA4 = aov(Forbes\(rank ~ Forbes\)marketvalue) ANOVA4 summary(ANOVA4) ``` The results of the analysis of variance leads me to reject the null hypothesis. According to the table and the significance associated with P value, we cannot say that variance in rank is not effected by these factors. They indeed have an affect, as expected.
The following anova tables summarize results for interaction effects. They can also be represented by interaction plots.
ANOVA5 = aov(Forbes$rank ~ Forbes$country*Forbes$category)
ANOVA5
## Call:
## aov(formula = Forbes$rank ~ Forbes$country * Forbes$category)
##
## Terms:
## Forbes$country Forbes$category
## Sum of Squares 42788048 31827343
## Deg. of Freedom 60 26
## Forbes$country:Forbes$category Residuals
## Sum of Squares 150795868 441255241
## Deg. of Freedom 376 1537
##
## Residual standard error: 535.8065
## 1184 out of 1647 effects not estimable
## Estimated effects may be unbalanced
summary(ANOVA5)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forbes$country 60 42788048 713134 2.484 5.56e-09 ***
## Forbes$category 26 31827343 1224129 4.264 6.08e-12 ***
## Forbes$country:Forbes$category 376 150795868 401053 1.397 1.02e-05 ***
## Residuals 1537 441255241 287089
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA6 = aov(Forbes$rank ~ Forbes$country*Forbes$profits)
ANOVA6
## Call:
## aov(formula = Forbes$rank ~ Forbes$country * Forbes$profits)
##
## Terms:
## Forbes$country Forbes$profits
## Sum of Squares 42961784 55257043
## Deg. of Freedom 60 1
## Forbes$country:Forbes$profits Residuals
## Sum of Squares 70964592 496409145
## Deg. of Freedom 46 1887
##
## Residual standard error: 512.9015
## 14 out of 122 effects not estimable
## Estimated effects may be unbalanced
## 5 observations deleted due to missingness
summary(ANOVA6)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forbes$country 60 42961784 716030 2.722 6.74e-11 ***
## Forbes$profits 1 55257043 55257043 210.049 < 2e-16 ***
## Forbes$country:Forbes$profits 46 70964592 1542709 5.864 < 2e-16 ***
## Residuals 1887 496409145 263068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5 observations deleted due to missingness
ANOVA7 = aov(Forbes$rank ~ Forbes$country*Forbes$marketvalue)
ANOVA7
## Call:
## aov(formula = Forbes$rank ~ Forbes$country * Forbes$marketvalue)
##
## Terms:
## Forbes$country Forbes$marketvalue
## Sum of Squares 42788048 141081290
## Deg. of Freedom 60 1
## Forbes$country:Forbes$marketvalue Residuals
## Sum of Squares 72367424 410429738
## Deg. of Freedom 46 1892
##
## Residual standard error: 465.7564
## 14 out of 122 effects not estimable
## Estimated effects may be unbalanced
summary(ANOVA7)
## Df Sum Sq Mean Sq F value
## Forbes$country 60 42788048 713134 3.287
## Forbes$marketvalue 1 141081290 141081290 650.357
## Forbes$country:Forbes$marketvalue 46 72367424 1573205 7.252
## Residuals 1892 410429738 216929
## Pr(>F)
## Forbes$country 1.71e-15 ***
## Forbes$marketvalue < 2e-16 ***
## Forbes$country:Forbes$marketvalue < 2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA8 = aov(Forbes$rank ~ Forbes$category*Forbes$profits)
ANOVA8
## Call:
## aov(formula = Forbes$rank ~ Forbes$category * Forbes$profits)
##
## Terms:
## Forbes$category Forbes$profits
## Sum of Squares 32946310 59956701
## Deg. of Freedom 26 1
## Forbes$category:Forbes$profits Residuals
## Sum of Squares 70396270 502293283
## Deg. of Freedom 26 1941
##
## Residual standard error: 508.7049
## Estimated effects may be unbalanced
## 5 observations deleted due to missingness
summary(ANOVA8)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forbes$category 26 32946310 1267166 4.897 9.58e-15
## Forbes$profits 1 59956701 59956701 231.689 < 2e-16
## Forbes$category:Forbes$profits 26 70396270 2707549 10.463 < 2e-16
## Residuals 1941 502293283 258781
##
## Forbes$category ***
## Forbes$profits ***
## Forbes$category:Forbes$profits ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5 observations deleted due to missingness
ANOVA9 = aov(Forbes$rank ~ Forbes$category*Forbes$marketvalue)
ANOVA9
## Call:
## aov(formula = Forbes$rank ~ Forbes$category * Forbes$marketvalue)
##
## Terms:
## Forbes$category Forbes$marketvalue
## Sum of Squares 32971553 150721547
## Deg. of Freedom 26 1
## Forbes$category:Forbes$marketvalue Residuals
## Sum of Squares 89156625 393816775
## Deg. of Freedom 26 1946
##
## Residual standard error: 449.8582
## Estimated effects may be unbalanced
summary(ANOVA9)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forbes$category 26 32971553 1268137 6.266 <2e-16
## Forbes$marketvalue 1 150721547 150721547 744.773 <2e-16
## Forbes$category:Forbes$marketvalue 26 89156625 3429101 16.945 <2e-16
## Residuals 1946 393816775 202372
##
## Forbes$category ***
## Forbes$marketvalue ***
## Forbes$category:Forbes$marketvalue ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA9 = aov(Forbes$rank ~ Forbes$profits*Forbes$marketvalue)
ANOVA9
## Call:
## aov(formula = Forbes$rank ~ Forbes$profits * Forbes$marketvalue)
##
## Terms:
## Forbes$profits Forbes$marketvalue
## Sum of Squares 61675431 99179668
## Deg. of Freedom 1 1
## Forbes$profits:Forbes$marketvalue Residuals
## Sum of Squares 66881795 437855670
## Deg. of Freedom 1 1991
##
## Residual standard error: 468.9536
## Estimated effects may be unbalanced
## 5 observations deleted due to missingness
summary(ANOVA9)
## Df Sum Sq Mean Sq F value Pr(>F)
## Forbes$profits 1 61675431 61675431 280.4 <2e-16
## Forbes$marketvalue 1 99179668 99179668 451.0 <2e-16
## Forbes$profits:Forbes$marketvalue 1 66881795 66881795 304.1 <2e-16
## Residuals 1991 437855670 219917
##
## Forbes$profits ***
## Forbes$marketvalue ***
## Forbes$profits:Forbes$marketvalue ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5 observations deleted due to missingness
interaction.plot(Forbes$category, Forbes$country, Forbes$rank)
interaction.plot(Forbes$profits, Forbes$country, Forbes$rank)
interaction.plot(Forbes$marketvalue, Forbes$country, Forbes$rank)
interaction.plot(Forbes$category, Forbes$profits, Forbes$rank)
interaction.plot(Forbes$category, Forbes$marketvalue, Forbes$rank)
It is difficult to determine the significance of the interaction effects between factors. Given low P values, I suspect there is an interaction effect. It is however extremely hard for me to grasp the effect given a lack of understanding and interpretation of the interaction plots.
Skip ### Diagnostics / Model Adequacy Checking Skip
Plotting residuals vs fitted values is a good method for checking the accuracy of the model. Normal Q-Q plot and a fitted line should be an adequate test.
http://www.stat.washington.edu/pds/stat502/LectureNotes/RCBD.pdf I used the definitions and principals of blocking, replication, and randomization discussed in the pdf prepared for University of Washington
https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Also used this resource for generating the rmf document. Extremely helpful sheet, would recommend it to anyone who has questions about R markdown.
Unlike most experimental results, it is quite unlikely that these data are the result of chance. There are however many additional factors which should affect company rank, but in the purpose of this study, I think the data includes factors that specifically are used to dictate rank, and therefore they should be extremely predictive.
Forbes = read.csv("Forbes2000.csv", header=TRUE, sep = ",")
#Load Forbes Dataset, containing 2000 entries & 8 variables Year 2004
head(Forbes) # shows the first 6 observations/entries
str(Forbes) # gives the structure of the dataset, showing factors and of their levels
summary(Forbes) # summarizes discritive statistics, giving an idea of some of the basic behaviors of each variable
country_int = as.integer(Forbes$country)
category_int = as.integer(Forbes$category)
profits_int = as.integer(Forbes$profits)
marketvalue_int = as.integer(Forbes$marketvalue)
hist(country_int)
hist(category_int)
boxplot(Forbes$rank~Forbes$profits, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$marketvalue, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$country, data = Forbes, vertical = TRUE, ylab="Rank")
boxplot(Forbes$rank~Forbes$category, data = Forbes, vertical = TRUE, ylab="Rank")
ANOVA1 = aov(Forbes$rank ~ Forbes$country)
ANOVA1
summary(ANOVA1)
ANOVA2 = aov(Forbes$rank ~ Forbes$category)
ANOVA2
summary(ANOVA2)
ANOVA3 = aov(Forbes$rank ~ Forbes$profits)
ANOVA3
summary(ANOVA3)
ANOVA4 = aov(Forbes$rank ~ Forbes$marketvalue)
ANOVA4
summary(ANOVA4)
ANOVA5 = aov(Forbes$rank ~ Forbes$country*Forbes$category)
ANOVA5
summary(ANOVA5)
ANOVA6 = aov(Forbes$rank ~ Forbes$country*Forbes$profits)
ANOVA6
summary(ANOVA6)
ANOVA7 = aov(Forbes$rank ~ Forbes$country*Forbes$marketvalue)
ANOVA7
summary(ANOVA7)
ANOVA8 = aov(Forbes$rank ~ Forbes$category*Forbes$profits)
ANOVA8
summary(ANOVA8)
ANOVA9 = aov(Forbes$rank ~ Forbes$category*Forbes$marketvalue)
ANOVA9
summary(ANOVA9)
ANOVA9 = aov(Forbes$rank ~ Forbes$profits*Forbes$marketvalue)
ANOVA9
summary(ANOVA9)
interaction.plot(Forbes$category, Forbes$country, Forbes$rank)
interaction.plot(Forbes$profits, Forbes$country, Forbes$rank)
interaction.plot(Forbes$marketvalue, Forbes$country, Forbes$rank)
interaction.plot(Forbes$category, Forbes$profits, Forbes$rank)
interaction.plot(Forbes$category, Forbes$marketvalue, Forbes$rank)