Final Project

Introduction

This report details an analysis of market prices using the Forbes ranking to compare between different countries and the relationships between different recorded statistical values.

The goal of this report is to:

Practice the use of all previously learned concepts
Explore the relationships of profit to other variables
Compare the United States and Japan
Build a multiple linear regression model
Build and compare models for Japanese and American companies.

Add the libraries and read in the data

Here we are adding in the Forbes2000 data set and the ggplot2 and dplyr libraries to help with the data analysis.

library(ggplot2)
library(dplyr)
forbes <- read.csv("Forbes2000.csv")

Analysis of relationship of profit to other variables

First I ran a correlation test comparing the profits variable with each of the other numerical variables.

cor.test(forbes$profits, forbes$rank)

## 
##  Pearson's product-moment correlation
## 
## data:  forbes$profits and forbes$rank
## t = -14.267, df = 1993, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3436994 -0.2640466
## sample estimates:
##        cor 
## -0.3044051

cor.test(forbes$profits, forbes$sales)

## 
##  Pearson's product-moment correlation
## 
## data:  forbes$profits and forbes$sales
## t = 19.732, df = 1993, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3668905 0.4403406
## sample estimates:
##       cor 
## 0.4042672

cor.test(forbes$profits, forbes$assets)

## 
##  Pearson's product-moment correlation
## 
## data:  forbes$profits and forbes$assets
## t = 10.278, df = 1993, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1822660 0.2656277
## sample estimates:
##       cor 
## 0.2243573

cor.test(forbes$profits, forbes$marketvalue)

## 
##  Pearson's product-moment correlation
## 
## data:  forbes$profits and forbes$marketvalue
## t = 29.187, df = 1993, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5157195 0.5772434
## sample estimates:
##       cor 
## 0.5472202

Next I used ggplot2 to create a plot of each of the numerical variables against profit to see each of the relationships.

It is interesting to see how quickly profit drops off when compared to the rank of the company, but many companies that have high ranks also had negative profit for year 2000.

ggplot(forbes,aes(x=rank, y=profits))+geom_point()+ggtitle("Forbes Profits versus Rank")+xlab("Forbes Ranking")+ylab("Annual Profits ($M)")

ggplot(forbes,aes(x=sales, y=profits))+geom_point()+ggtitle("Forbes Profits versus Sales")+xlab("Sales ($M)")+ylab("Annual Profits ($M)")

ggplot(forbes,aes(x=assets, y=profits))+geom_point()+ggtitle("Forbes Profits versus Assets")+xlab("Assets ($M)")+ylab("Annual Profits ($M)")

ggplot(forbes,aes(x=marketvalue, y=profits))+geom_point()+ggtitle("Forbes Profits versus Market Value")+xlab("Market Value")+ylab("Annual Profits ($M)")

Next. we grouped all of the different companies by their category and arranged them based on the average profit of companies in that category to determine the company categories with the highest average profits.

forbes %>%
  group_by(category)%>%
  summarise(avgProfit=mean(profits)) %>%
  arrange(desc(avgProfit)) %>% head()

## # A tibble: 6 × 2
##   category              avgProfit
##   <chr>                     <dbl>
## 1 Drugs & biotechnology     1.45 
## 2 Oil & gas operations      1.31 
## 3 Conglomerates             1.01 
## 4 Food drink & tobacco      0.594
## 5 Software & services       0.568
## 6 Consumer durables         0.566

paste("The company types that have the highest profits are drugs & biotechnology, oil & gas operations, and conglomerates")

## [1] "The company types that have the highest profits are drugs & biotechnology, oil & gas operations, and conglomerates"

Next, we did the same analysis but reversed the order of the categories to show the company categories with the lowest average profit.

forbes %>%
  group_by(category)%>%
  summarise(avgProfit=mean(profits)) %>%
  arrange(avgProfit) %>% head()

## # A tibble: 6 × 2
##   category                        avgProfit
##   <chr>                               <dbl>
## 1 Trading companies                  0.028 
## 2 Capital goods                      0.0955
## 3 Business services & supplies       0.171 
## 4 Materials                          0.196 
## 5 Construction                       0.198 
## 6 Technology hardware & equipment    0.206

paste("The company types that have the lowest profits are tranding companies, capital goods, and buisness services & supplies")

## [1] "The company types that have the lowest profits are tranding companies, capital goods, and buisness services & supplies"

Next, we shifted from analyzing using the company type and instead used the company country to show the countries that had the highest average profit.

One of the main limitations of this breakdown is multinational companies for which the average was calculated separately than from companies in either of the countries that company is based.

forbes %>%
  group_by(country)%>%
  summarise(avgProfit=mean(profits)) %>%
  arrange(desc(avgProfit)) %>% head()

## # A tibble: 6 × 2
##   country                     avgProfit
##   <chr>                           <dbl>
## 1 Netherlands/ United Kingdom      5.32
## 2 United Kingdom/ Australia        1.64
## 3 Russia                           1.24
## 4 Kong/China                       1.19
## 5 Panama/ United Kingdom           1.18
## 6 Australia/ United Kingdom        1.18

paste("The countries that generate the most profits are the Netherlands/UK, UK/Australia, and Russia")

## [1] "The countries that generate the most profits are the Netherlands/UK, UK/Australia, and Russia"

Again, the next analysis shows the lowest average national profit. This contains the same limitation of the previous national analysis and the limitation is more clear when focusing on the variation in average profit of multinational UK based companies.

forbes %>%
  group_by(country)%>%
  summarise(avgProfit=mean(profits)) %>%
  arrange(avgProfit) %>% head()

## # A tibble: 6 × 2
##   country                      avgProfit
##   <chr>                            <dbl>
## 1 France/ United Kingdom        -2.83   
## 2 Luxembourg                    -0.125  
## 3 United Kingdom/ South Africa  -0.1    
## 4 Netherlands                   -0.0389 
## 5 Germany                       -0.0382 
## 6 Africa                        -0.00500

paste("The countries that generate the least profits are France/UK, Luxembourg, and UK/ South Africa")

## [1] "The countries that generate the least profits are France/UK, Luxembourg, and UK/ South Africa"

Next, I repeated the above analyses using average sales instead of average profit.

forbes %>%
  group_by(category)%>%
  summarise(avgSales=mean(sales)) %>%
  arrange(desc(avgSales)) %>% head()

## # A tibble: 6 × 2
##   category             avgSales
##   <chr>                   <dbl>
## 1 Trading companies        29.1
## 2 Consumer durables        24.1
## 3 Food markets             21.3
## 4 Oil & gas operations     19.5
## 5 Conglomerates            16.0
## 6 Aerospace & defense      14.4

paste("The company types that have the highest sales are trading companies, consumer durables, and food markets.")

## [1] "The company types that have the highest sales are trading companies, consumer durables, and food markets."

forbes %>%
  group_by(category)%>%
  summarise(avgSales=mean(sales)) %>%
  arrange(avgSales) %>% head()

## # A tibble: 6 × 2
##   category                     avgSales
##   <chr>                           <dbl>
## 1 Banking                          5.31
## 2 Software & services              5.38
## 3 Hotels restaurants & leisure     5.70
## 4 Diversified financials           5.74
## 5 Semiconductors                   5.90
## 6 Media                            6.35

paste("The company types that have the lowest sales are banking, software & services, and hotels restaurants & leisure")

## [1] "The company types that have the lowest sales are banking, software & services, and hotels restaurants & leisure"

forbes %>%
  group_by(country)%>%
  summarise(avgSales=mean(sales)) %>%
  arrange(desc(avgSales)) %>% head()

## # A tibble: 6 × 2
##   country                     avgSales
##   <chr>                          <dbl>
## 1 Netherlands/ United Kingdom     92.1
## 2 Germany                         20.8
## 3 France                          20.1
## 4 Netherlands                     17.0
## 5 Korea                           15.0
## 6 Luxembourg                      14.2

paste("The countries that generate the most sales are the Netherlands/UK, Germany, and France.")

## [1] "The countries that generate the most sales are the Netherlands/UK, Germany, and France."

forbes %>%
  group_by(country)%>%
  summarise(avgSales=mean(sales)) %>%
  arrange(avgSales) %>% head()

## # A tibble: 6 × 2
##   country                avgSales
##   <chr>                     <dbl>
## 1 Peru                       0.17
## 2 Venezuela                  0.98
## 3 France/ United Kingdom     1.01
## 4 Pakistan                   1.23
## 5 Jordan                     1.33
## 6 Bahamas                    1.35

paste("The countries that generate the lowest sales are Peru, Venezuela, and France/UK.")

## [1] "The countries that generate the lowest sales are Peru, Venezuela, and France/UK."

Comparison of the USA and Japan

The next set of analyses compare the United States and Japan. To do this, we first created a subset of the main forbes2000 data set for each country.

JPN = subset(forbes, country == "Japan")
USA = subset(forbes, country == "United States")
str(JPN)

## 'data.frame':    316 obs. of  9 variables:
##  $ X          : int  8 30 49 51 72 82 138 143 156 157 ...
##  $ rank       : int  8 30 49 51 72 82 138 143 156 157 ...
##  $ name       : chr  "Toyota Motor" "Nippon Tel & Tel" "Honda Motor" "Nissan Motor" ...
##  $ country    : chr  "Japan" "Japan" "Japan" "Japan" ...
##  $ category   : chr  "Consumer durables" "Telecommunications services" "Consumer durables" "Consumer durables" ...
##  $ sales      : num  135.8 92.4 67.4 57.8 41.6 ...
##  $ profits    : num  7.99 2.17 3.61 4.19 1.4 0.98 1.61 0.83 0.46 0.68 ...
##  $ assets     : num  171.7 150.9 63.1 60.6 116.7 ...
##  $ marketvalue: num  115.4 73 40.6 41.7 30.6 ...

str(USA)

## 'data.frame':    751 obs. of  9 variables:
##  $ X          : int  1 2 3 4 6 9 10 14 15 16 ...
##  $ rank       : int  1 2 3 4 6 9 10 14 15 16 ...
##  $ name       : chr  "Citigroup" "General Electric" "American Intl Group" "ExxonMobil" ...
##  $ country    : chr  "United States" "United States" "United States" "United States" ...
##  $ category   : chr  "Banking" "Conglomerates" "Insurance" "Oil & gas operations" ...
##  $ sales      : num  94.7 134.2 76.7 222.9 49 ...
##  $ profits    : num  17.85 15.59 6.46 20.96 10.81 ...
##  $ assets     : num  1264 627 648 167 736 ...
##  $ marketvalue: num  255 329 195 277 118 ...

The first analysis was to find what the highest ranked company was for each country. Because highest rank corresponds with the lowest number, I used the min function.

min(JPN$rank)

## [1] 8

min(USA$rank)

## [1] 1

paste("The highest ranked company using Forbes from the United States is ranked 1st, while the highest ranked company from Japan is ranked 8th.")

## [1] "The highest ranked company using Forbes from the United States is ranked 1st, while the highest ranked company from Japan is ranked 8th."

The next analysis was to find the most common type of company in each of the countries. To do this, I used the table function to find the number of observations within each of the country subsets that fit each company type, then sorted the categories in descending order to show the most common company type.

USA.table <- table(USA$category)
USA.table1 <- sort(desc(USA.table))
paste("United States")

## [1] "United States"

head(abs(USA.table1))

## 
##                          Banking           Diversified financials 
##                               83                               60 
##                        Utilities Health care equipment & services 
##                               54                               53 
##                        Retailing                        Insurance 
##                               53                               46

JPN.table <-table(JPN$category)
JPN.table1 <- sort(desc(JPN.table))
paste("Japan")

## [1] "Japan"

head(abs(JPN.table1))

## 
##                Banking Diversified financials      Consumer durables 
##                     69                     24                     22 
##         Transportation          Capital goods           Construction 
##                     20                     19                     18

paste("The most common company type in both countries is banking, in the United States this is followed by Diversified financials and utilities. In Japan this is followed by Diversified financials and Consumer durables")

## [1] "The most common company type in both countries is banking, in the United States this is followed by Diversified financials and utilities. In Japan this is followed by Diversified financials and Consumer durables"

Next I ran several plots to compare the two countries and the relationships of different variables with the profits of companies in those countries. When comparing the profits to the rank of the countries, Japanese companies tend to make a lower profit at similar rank values when compared to American companies - to emphasize this, the y axis scale was adjusted to a log.

combine.sub <- subset(forbes, country =="United States"| country =="Japan")
ggplot(combine.sub, aes(x=rank,y=log(profits), col=country))+geom_point(alpha=0.4)+geom_smooth()+ggtitle("American and Japanese Profits versus Forbes Rank")+xlab("Forbes Ranking")+ylab("Log (Annual Profits ($M))")

ggplot(combine.sub, aes(x=sales,y=profits, col=country))+geom_point()+ggtitle("American and Japanese Profits versus Sales")+xlab("Sales")+ylab("Log (Annual Profits ($M))")

ggplot(combine.sub, aes(x=marketvalue,y=profits, col=country))+geom_point()+ggtitle("American and Japanese Profits versus Market Value")+xlab("Market Value")+ylab("Log (Annual Profits ($M))")

Build a linear model to estimate profits

The next analysis I created a linear model by first using an adjust value for the variables of assets, market value, and sales by normalizing with the respective means of those variables)

assets.cen = forbes$assets-mean(forbes$assets)
mv.cen = forbes$marketvalue-mean(forbes$marketvalue)
sales.cen = forbes$sales-mean(forbes$sales)

Next, I used those adjusted variables to create a linear model fit to for the independent profits variable along with an analysis of the results using the summary and anova functions, and creating a plot of the residuals.

lm.fit = lm(profits ~ assets.cen + mv.cen + sales.cen, forbes)
lm.fit

## 
## Call:
## lm(formula = profits ~ assets.cen + mv.cen + sales.cen, data = forbes)
## 
## Coefficients:
## (Intercept)   assets.cen       mv.cen    sales.cen  
##   0.3802541   -0.0008395    0.0363340    0.0098892

summary(lm.fit)

## 
## Call:
## lm(formula = profits ~ assets.cen + mv.cen + sales.cen, data = forbes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.2169  -0.0189   0.1160   0.2107   8.9495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3802541  0.0329542  11.539   <2e-16 ***
## assets.cen  -0.0008395  0.0003781  -2.220   0.0265 *  
## mv.cen       0.0363340  0.0018183  19.982   <2e-16 ***
## sales.cen    0.0098892  0.0024331   4.064    5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.472 on 1991 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3049 
## F-statistic: 292.5 on 3 and 1991 DF,  p-value: < 2.2e-16

anova(lm.fit)

## Analysis of Variance Table
## 
## Response: profits
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## assets.cen    1  312.8  312.84  144.40 < 2.2e-16 ***
## mv.cen        1 1552.8 1552.77  716.71 < 2.2e-16 ***
## sales.cen     1   35.8   35.79   16.52 5.001e-05 ***
## Residuals  1991 4313.6    2.17                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(lm.fit, which = 1)

paste("The variable that appears to have the greatest effect on profits is sales")

## [1] "The variable that appears to have the greatest effect on profits is sales"

paste("The residuals tend to be clustered near 0 with lower values on the x-axis's fitted values, with more variation with found at higher x values.")

## [1] "The residuals tend to be clustered near 0 with lower values on the x-axis's fitted values, with more variation with found at higher x values."

Build a model for each Japanese and American companies

The final analysis I ran was to create a model similar to the above for each of the Japanese and American companies using the data subsets created above. First I created a model for the Japanese companies.

JPN.assets.cen = JPN$assets-mean(JPN$assets)
JPN.mv.cen = JPN$marketvalue-mean(JPN$marketvalue)
JPN.sales.cen = JPN$sales-mean(JPN$sales)
JPN.fit = lm(profits ~ JPN.assets.cen + JPN.mv.cen + JPN.sales.cen, JPN)
summary(JPN.fit)

## 
## Call:
## lm(formula = profits ~ JPN.assets.cen + JPN.mv.cen + JPN.sales.cen, 
##     data = JPN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2316 -0.1651  0.0184  0.2160  5.3909 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.0223734  0.0456102   0.491    0.624    
## JPN.assets.cen -0.0126813  0.0004999 -25.368   <2e-16 ***
## JPN.mv.cen      0.0765111  0.0062517  12.239   <2e-16 ***
## JPN.sales.cen  -0.0006585  0.0034725  -0.190    0.850    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8108 on 312 degrees of freedom
## Multiple R-squared:  0.6857, Adjusted R-squared:  0.6827 
## F-statistic: 226.9 on 3 and 312 DF,  p-value: < 2.2e-16

anova(JPN.fit)

## Analysis of Variance Table
## 
## Response: profits
##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## JPN.assets.cen   1 284.846 284.846 433.311 <2e-16 ***
## JPN.mv.cen       1 162.574 162.574 247.310 <2e-16 ***
## JPN.sales.cen    1   0.024   0.024   0.036 0.8497    
## Residuals      312 205.100   0.657                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then I ran the same analysis for the American companies.

USA.assets.cen = USA$assets-mean(USA$assets)
USA.mv.cen = USA$marketvalue-mean(USA$marketvalue)
USA.sales.cen = USA$sales-mean(USA$sales)
USA.fit = lm(profits ~ USA.assets.cen + USA.mv.cen + USA.sales.cen, USA)
summary(USA.fit)

## 
## Call:
## lm(formula = profits ~ USA.assets.cen + USA.mv.cen + USA.sales.cen, 
##     data = USA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4954 -0.0620  0.1217  0.1953  7.8811 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.6491464  0.0296828  21.869  < 2e-16 ***
## USA.assets.cen 0.0043831  0.0003655  11.993  < 2e-16 ***
## USA.mv.cen     0.0339639  0.0012747  26.645  < 2e-16 ***
## USA.sales.cen  0.0138408  0.0020376   6.793 2.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8118 on 744 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.7942 
## F-statistic: 961.8 on 3 and 744 DF,  p-value: < 2.2e-16

anova(USA.fit)

## Analysis of Variance Table
## 
## Response: profits
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## USA.assets.cen   1 975.23  975.23 1479.783 < 2.2e-16 ***
## USA.mv.cen       1 895.96  895.96 1359.498 < 2.2e-16 ***
## USA.sales.cen    1  30.41   30.41   46.142 2.253e-11 ***
## Residuals      744 490.32    0.66                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Before looking at the two different analyses and comparing them against each other to further see how the countries differ beyond the differences shown in the first head-to-head analysis above.

paste("In Japan the most important variable is the sales")

## [1] "In Japan the most important variable is the sales"

paste("In the United States the most important variable is the sales company assets")

## [1] "In the United States the most important variable is the sales company assets"

paste("The sum of squares for Japan is much lower than the United States, suggesting that the model fits Japan better than the United States.")

## [1] "The sum of squares for Japan is much lower than the United States, suggesting that the model fits Japan better than the United States."

paste("The United States has a larger F-statistic (961.8 compared to 226.9) because of the larger number of companies in the ranking from the United States.")

## [1] "The United States has a larger F-statistic (961.8 compared to 226.9) because of the larger number of companies in the ranking from the United States."

paste("The mean squared error is very similar between the two countries because the larger sample size in the United States.")

## [1] "The mean squared error is very similar between the two countries because the larger sample size in the United States."