forb <- read.csv("Forbes2000.csv")Project 2
Project 2: Market Prices
Reading in the csv file
Question 1
Using this file, explore the relationship of profits to the other variables. Which company types generate the highest (and lowest) profits? Which countries generate the most profits? What about sales?
To look at high and low profits and sales, I use the average and the total for each of these categories.
head(forb) X rank name country category sales profits
1 1 1 Citigroup United States Banking 94.71 17.85
2 2 2 General Electric United States Conglomerates 134.19 15.59
3 3 3 American Intl Group United States Insurance 76.66 6.46
4 4 4 ExxonMobil United States Oil & gas operations 222.88 20.96
5 5 5 BP United Kingdom Oil & gas operations 232.57 10.27
6 6 6 Bank of America United States Banking 49.01 10.81
assets marketvalue
1 1264.03 255.30
2 626.93 328.54
3 647.66 194.87
4 166.99 277.02
5 177.57 173.54
6 736.45 117.55
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Looking at profits
First, I looked at the average profit by company type (using the group_by() function) and then arranged it first in descending order, then ascending order to get the max and min. I did the same afterwards for total profit for company type. After this, I use the same process for country, but I group the data by country instead of company type.
forb |>
group_by(category) |>
summarise(
avgprofit = mean(profits)
) |>
arrange(desc(avgprofit))# A tibble: 27 × 2
category avgprofit
<chr> <dbl>
1 Drugs & biotechnology 1.45
2 Oil & gas operations 1.31
3 Conglomerates 1.01
4 Food drink & tobacco 0.594
5 Software & services 0.568
6 Consumer durables 0.566
7 Household & personal products 0.550
8 Diversified financials 0.500
9 Retailing 0.476
10 Semiconductors 0.437
# ℹ 17 more rows
forb |>
group_by(category) |>
summarise(
avgprofit = mean(profits)
) |>
arrange(avgprofit)# A tibble: 27 × 2
category avgprofit
<chr> <dbl>
1 Trading companies 0.028
2 Capital goods 0.0955
3 Business services & supplies 0.171
4 Materials 0.196
5 Construction 0.198
6 Technology hardware & equipment 0.206
7 Media 0.211
8 Utilities 0.211
9 Food markets 0.249
10 Hotels restaurants & leisure 0.259
# ℹ 17 more rows
On average, the drugs and biotechnology company type makes the most profit (1.45), while trading companies make the least profit on average (0.028).
forb |>
group_by(category) |>
summarise(
totalprofit = sum(profits)
) |>
arrange(desc(totalprofit))# A tibble: 27 × 2
category totalprofit
<chr> <dbl>
1 Banking 132.
2 Oil & gas operations 118.
3 Diversified financials 78.9
4 Drugs & biotechnology 65.2
5 Food drink & tobacco 49.3
6 Consumer durables 41.9
7 Retailing 41.9
8 Conglomerates 31.4
9 Household & personal products 24.2
10 Health care equipment & services 23.5
# ℹ 17 more rows
forb |>
group_by(category) |>
summarise(
totalprofit = sum(profits)
) |>
arrange(totalprofit)# A tibble: 27 × 2
category totalprofit
<chr> <dbl>
1 Trading companies 0.7
2 Capital goods 5.06
3 Aerospace & defense 5.48
4 Food markets 8.22
5 Hotels restaurants & leisure 9.57
6 Semiconductors 11.4
7 Business services & supplies 12.0
8 Technology hardware & equipment 12.1
9 Media 12.8
10 Chemicals 13.0
# ℹ 17 more rows
When we look at total profits among a company types, banking made the most total profit (132), while trading companies made the least total profit (0.7).
forb |>
group_by(country) |>
summarise(
avgprofit = mean(profits)
) |>
arrange(desc(avgprofit))# A tibble: 61 × 2
country avgprofit
<chr> <dbl>
1 Netherlands/ United Kingdom 5.32
2 United Kingdom/ Australia 1.64
3 Russia 1.24
4 Kong/China 1.19
5 Panama/ United Kingdom 1.18
6 Australia/ United Kingdom 1.18
7 Islands 0.74
8 Finland 0.641
9 Norway 0.631
10 China 0.622
# ℹ 51 more rows
forb |>
group_by(country) |>
summarise(
avgprofit = mean(profits)
) |>
arrange(avgprofit)# A tibble: 61 × 2
country avgprofit
<chr> <dbl>
1 France/ United Kingdom -2.83
2 Luxembourg -0.125
3 United Kingdom/ South Africa -0.1
4 Netherlands -0.0389
5 Germany -0.0382
6 Africa -0.00500
7 Japan 0.0224
8 Korea 0.03
9 Israel 0.0987
10 Austria 0.108
# ℹ 51 more rows
The Netherlands & UK (probably joint companies) have the highest average profit overall (5.32), while a joint company between France & the UK had the lowest average profit overall (-2.83).
forb |>
group_by(country) |>
summarise(
totalprofit = sum(profits)
) |>
arrange(desc(totalprofit))# A tibble: 61 × 2
country totalprofit
<chr> <dbl>
1 Canada 23.3
2 South Korea 15.6
3 China 15.5
4 Russia 14.9
5 Switzerland 13.8
6 Spain 11.7
7 Netherlands/ United Kingdom 10.6
8 India 9.37
9 Bermuda 9.11
10 Italy 7.52
# ℹ 51 more rows
forb |>
group_by(country) |>
summarise(
totalprofit = sum(profits)
) |>
arrange(totalprofit)# A tibble: 61 × 2
country totalprofit
<chr> <dbl>
1 France/ United Kingdom -2.83
2 Germany -2.48
3 Netherlands -1.09
4 Luxembourg -0.25
5 United Kingdom/ South Africa -0.1
6 Africa -0.0100
7 Peru 0.11
8 Venezuela 0.12
9 Korea 0.12
10 United Kingdom/ Netherlands 0.14
# ℹ 51 more rows
When looking at total profit, Canada has the highest total profit (23.3), while joint companies between France & the UK have the lowest total profit (-2.83).
Looking at sales
forb |>
group_by(category) |>
summarise(
avgsales = mean(sales)
) |>
arrange(desc(avgsales))# A tibble: 27 × 2
category avgsales
<chr> <dbl>
1 Trading companies 29.1
2 Consumer durables 24.1
3 Food markets 21.3
4 Oil & gas operations 19.5
5 Conglomerates 16.0
6 Aerospace & defense 14.4
7 Retailing 13.8
8 Telecommunications services 12.7
9 Insurance 12.0
10 Technology hardware & equipment 10.8
# ℹ 17 more rows
forb |>
group_by(category) |>
summarise(
avgsales = mean(sales)
) |>
arrange(avgsales)# A tibble: 27 × 2
category avgsales
<chr> <dbl>
1 Banking 5.31
2 Software & services 5.38
3 Hotels restaurants & leisure 5.70
4 Diversified financials 5.74
5 Semiconductors 5.90
6 Media 6.35
7 Materials 6.54
8 Business services & supplies 6.70
9 Construction 6.70
10 Household & personal products 7.19
# ℹ 17 more rows
On average, trading companies make the most sales (29.1), while banking companies make the least sales (5.31).
forb |>
group_by(category) |>
summarise(
totalsales = sum(sales)
) |>
arrange(desc(totalsales))# A tibble: 27 × 2
category totalsales
<chr> <dbl>
1 Consumer durables 1780.
2 Oil & gas operations 1755.
3 Banking 1663.
4 Insurance 1340.
5 Retailing 1211.
6 Utilities 950.
7 Diversified financials 907.
8 Telecommunications services 853.
9 Food drink & tobacco 758.
10 Trading companies 727.
# ℹ 17 more rows
forb |>
group_by(category) |>
summarise(
totalsales = sum(sales)
) |>
arrange(totalsales)# A tibble: 27 × 2
category totalsales
<chr> <dbl>
1 Semiconductors 153.
2 Software & services 167.
3 Hotels restaurants & leisure 211.
4 Aerospace & defense 274.
5 Household & personal products 316.
6 Media 387.
7 Chemicals 408.
8 Drugs & biotechnology 412.
9 Capital goods 433.
10 Business services & supplies 469.
# ℹ 17 more rows
When we look at total sales among a company types, consumer durable companies made the most total sales (1780), while semiconductor companies made the least total sales (153).
forb |>
group_by(country) |>
summarise(
avgsales = mean(sales)
) |>
arrange(desc(avgsales))# A tibble: 61 × 2
country avgsales
<chr> <dbl>
1 Netherlands/ United Kingdom 92.1
2 Germany 20.8
3 France 20.1
4 Netherlands 17.0
5 Korea 15.0
6 Luxembourg 14.2
7 Switzerland 12.5
8 Australia/ United Kingdom 11.6
9 Norway 10.8
10 United Kingdom 10.4
# ℹ 51 more rows
forb |>
group_by(country) |>
summarise(
avgsales = mean(sales)
) |>
arrange(avgsales)# A tibble: 61 × 2
country avgsales
<chr> <dbl>
1 Peru 0.17
2 Venezuela 0.98
3 France/ United Kingdom 1.01
4 Pakistan 1.23
5 Jordan 1.33
6 Bahamas 1.35
7 Philippines 1.56
8 Chile 1.60
9 Cayman Islands 1.66
10 Malaysia 1.72
# ℹ 51 more rows
The Netherlands & UK (joint companies) had the highest average sales overall (92.1), while a Peru had the lowest average sales overall (0.17).
forb |>
group_by(country) |>
summarise(
totalsales = sum(sales)
) |>
arrange(desc(totalsales))# A tibble: 61 × 2
country totalsales
<chr> <dbl>
1 United States 7554.
2 Japan 3220.
3 United Kingdom 1431.
4 Germany 1351.
5 France 1266.
6 Netherlands 477.
7 Switzerland 424.
8 Italy 419.
9 Canada 360.
10 South Korea 359.
# ℹ 51 more rows
forb |>
group_by(country) |>
summarise(
totalsales = sum(sales)
) |>
arrange(totalsales)# A tibble: 61 × 2
country totalsales
<chr> <dbl>
1 Peru 0.17
2 Venezuela 0.98
3 France/ United Kingdom 1.01
4 Pakistan 1.23
5 Jordan 1.33
6 Bahamas 1.35
7 United Kingdom/ South Africa 2.06
8 New Zealand 2.64
9 Philippines 3.13
10 Czech Republic 3.61
# ℹ 51 more rows
When looking at the total sales, the United States had the highest sales overall (7554), while Peru had the lowest total sales overall.
Question 2
Compare the USA and Japan. Which country has the highest rank using Forbes ranking? Which company types are more common in the USA? In Japan?
First, I create a new dataset for the United States and then one for Japan where I can see the data arranged by rank. Then I create a table of the company type so that I can see the count for each of these datasets and sort it.
forbusa = forb |>
filter(country == "United States") |>
arrange(rank)
head(forbusa) X rank name country category sales
1 1 1 Citigroup United States Banking 94.71
2 2 2 General Electric United States Conglomerates 134.19
3 3 3 American Intl Group United States Insurance 76.66
4 4 4 ExxonMobil United States Oil & gas operations 222.88
5 6 6 Bank of America United States Banking 49.01
6 9 9 Fannie Mae United States Diversified financials 53.13
profits assets marketvalue
1 17.85 1264.03 255.30
2 15.59 626.93 328.54
3 6.46 647.66 194.87
4 20.96 166.99 277.02
5 10.81 736.45 117.55
6 6.48 1019.17 76.84
forbjapan = forb |>
filter(country == "Japan") |>
arrange(rank)
head(forbjapan) X rank name country category sales
1 8 8 Toyota Motor Japan Consumer durables 135.82
2 30 30 Nippon Tel & Tel Japan Telecommunications services 92.41
3 49 49 Honda Motor Japan Consumer durables 67.44
4 51 51 Nissan Motor Japan Consumer durables 57.77
5 72 72 Tokyo Electric Power Japan Utilities 41.62
6 82 82 Sony Japan Consumer durables 63.23
profits assets marketvalue
1 7.99 171.71 115.40
2 2.17 150.87 73.00
3 3.61 63.09 40.61
4 4.19 60.56 41.71
5 1.40 116.68 30.63
6 0.98 68.04 38.00
The United States has the highest rank using the Forbes ranking. It has a company that is ranked 1st, while Japan’s highest ranked company is 8th.
compusa <- table(forbusa$category)
compusa <- sort(compusa)
compusa
Food markets Aerospace & defense
9 10
Capital goods Conglomerates
10 11
Chemicals Semiconductors
13 16
Telecommunications services Hotels restaurants & leisure
16 17
Transportation Construction
17 18
Household & personal products Drugs & biotechnology
20 21
Software & services Consumer durables
21 25
Materials Food drink & tobacco
26 28
Media Business services & supplies
28 31
Oil & gas operations Technology hardware & equipment
32 33
Insurance Health care equipment & services
46 53
Retailing Utilities
53 54
Diversified financials Banking
60 83
compjapan <- table(forbjapan$category)
compjapan <- sort(compjapan)
compjapan
Food markets Telecommunications services
2 2
Oil & gas operations Semiconductors
3 3
Software & services Health care equipment & services
3 4
Technology hardware & equipment Household & personal products
6 7
Media Drugs & biotechnology
7 9
Insurance Food drink & tobacco
9 10
Utilities Materials
11 12
Retailing Trading companies
12 13
Chemicals Business services & supplies
14 17
Construction Capital goods
18 19
Transportation Consumer durables
20 22
Diversified financials Banking
24 69
It looks like banking companies are the most common type of company for both the US and Japan.
Question 3
Now build a multiple linear regression model to estimate profits using assets, market value and sales. Use the summary() function to find the coefficients and goodness-of-fit of the model. Use the anova() function to identify which variable appears to have the greatest effect on profits. Remember to look at the distribution of residuals.
fit1 <- lm(profits ~ assets + marketvalue + sales, data = forb)
summary(fit1)
Call:
lm(formula = profits ~ assets + marketvalue + sales, data = forb)
Residuals:
Min 1Q Median 3Q Max
-29.2169 -0.0189 0.1160 0.2107 8.9495
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1186259 0.0380039 -3.121 0.00183 **
assets -0.0008395 0.0003781 -2.220 0.02651 *
marketvalue 0.0363340 0.0018183 19.982 < 2e-16 ***
sales 0.0098892 0.0024331 4.064 5e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.472 on 1991 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.3059, Adjusted R-squared: 0.3049
F-statistic: 292.5 on 3 and 1991 DF, p-value: < 2.2e-16
The coefficients for the multiple linear regression model are the following:
Assets (-0.0008395): As assets held by a company increase by $1 million, profits decrease by $839.5. This coefficient has a p-value of 0.02651, which means that it is only statistically significant at the 90% confidence level.
Market Value (0.0363340): As the market value of a company increases by $1 million, the profits of the company increase by $36,334. This coefficient has a p-value of < 0.001, so it is statistically significant at the 99% confidence level.
Sales (0.0098892): As the sales of a company increase by $1 million, profits increase by $9,889.20. This coefficient also has a p-value of < 0.001, so it is statistically significant at the 99% confidence level.
The adjusted R-squared for this model is 0.3049, which means that it explains about 30.49% of the variance in the data.
anova(fit1)Analysis of Variance Table
Response: profits
Df Sum Sq Mean Sq F value Pr(>F)
assets 1 312.8 312.84 144.40 < 2.2e-16 ***
marketvalue 1 1552.8 1552.77 716.71 < 2.2e-16 ***
sales 1 35.8 35.79 16.52 5.001e-05 ***
Residuals 1991 4313.6 2.17
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the anova(), market value seems to have the biggest effect on profits because it has the highest value for the sum sq and also the highest F-value. It is also statistically significant at the 99% confidence level with a p-value < 0.001.
hist(residuals(fit1))plot(fit1, which = 1)The line on the residual plot is fairly flat, which means that we may have a linear relationship between the variables that we chose; however, they do seem to generally be clustered in one area.
Question 4
Build two models using the same variables but for Japanese and American companies. Use the anova() function to look at which variables are the most important for each region? What differences do you observe?
fitusa <- lm(profits ~ assets + marketvalue + sales, data = forbusa)
summary(fitusa)
Call:
lm(formula = profits ~ assets + marketvalue + sales, data = forbusa)
Residuals:
Min 1Q Median 3Q Max
-4.4954 -0.0620 0.1217 0.1953 7.8811
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1465337 0.0336610 -4.353 1.53e-05 ***
assets 0.0043831 0.0003655 11.993 < 2e-16 ***
marketvalue 0.0339639 0.0012747 26.645 < 2e-16 ***
sales 0.0138408 0.0020376 6.793 2.25e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8118 on 744 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.795, Adjusted R-squared: 0.7942
F-statistic: 961.8 on 3 and 744 DF, p-value: < 2.2e-16
The coefficients for the multiple linear regression model of the US are the following:
Assets (0.0043831): As assets held by a company increase by $1 million, profits increase by $4383.10.
Market Value (0.0339639): As the market value of a company increases by $1 million, the profits of the company increase by $33,963.90.
Sales (0.0138408): As the sales of a company increase by $1 million, profits increase by $13,840.80.
Each of these coefficients has a p-value of < 0.001, so they are statistically significant at the 99% confidence level.
The adjusted R-squared for this model is 0.7942, which means that it explains about 79.42% of the variance in the data (more than the model that included all countries).
anova(fitusa)Analysis of Variance Table
Response: profits
Df Sum Sq Mean Sq F value Pr(>F)
assets 1 975.23 975.23 1479.783 < 2.2e-16 ***
marketvalue 1 895.96 895.96 1359.498 < 2.2e-16 ***
sales 1 30.41 30.41 46.142 2.253e-11 ***
Residuals 744 490.32 0.66
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the anova(), assets seems to have the biggest effect on profits, closely followed by market value. These results are both statistically significant at the 99% confidence level with a p-value < 0.001.
fitjapan <- lm(profits ~ assets + marketvalue + sales, data = forbjapan)
summary(fitjapan)
Call:
lm(formula = profits ~ assets + marketvalue + sales, data = forbjapan)
Residuals:
Min 1Q Median 3Q Max
-8.2316 -0.1651 0.0184 0.2160 5.3909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0731901 0.0551181 -1.328 0.185
assets -0.0126813 0.0004999 -25.368 <2e-16 ***
marketvalue 0.0765111 0.0062517 12.239 <2e-16 ***
sales -0.0006585 0.0034725 -0.190 0.850
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8108 on 312 degrees of freedom
Multiple R-squared: 0.6857, Adjusted R-squared: 0.6827
F-statistic: 226.9 on 3 and 312 DF, p-value: < 2.2e-16
The coefficients for the multiple linear regression model of Japan are the following:
Assets (-0.0128613): As assets held by a company increase by $1 million, profits decrease by $12,861.30.
Market Value (0.0765111): As the market value of a company increases by $1 million, the profits of the company increase by $76,511.10.
Sales (-0.0006585): As the sales of a company increase by $1 million, profits decrease by $658.50.
Only the assets and market value coefficients have a p-value of < 0.001, so they are statistically significant at the 99% confidence level. The sales coefficient is not statistically significant.
The adjusted R-squared for this model is 0.6827, which means that it explains about 68.27% of the variance in the data (more than the model that included all countries, but less than the model for the US).
anova(fitjapan)Analysis of Variance Table
Response: profits
Df Sum Sq Mean Sq F value Pr(>F)
assets 1 284.846 284.846 433.311 <2e-16 ***
marketvalue 1 162.574 162.574 247.310 <2e-16 ***
sales 1 0.024 0.024 0.036 0.8497
Residuals 312 205.100 0.657
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the anova(), assets seems to have the biggest effect on profits. These results are statistically significant at the 99% confidence level with a p-value < 0.001.
There are some slight differences in the results of the model for the US and Japan. First, the sales coefficient was not significant for Japan, but it was significant for the US. Additionally, the assets coefficient for Japan was negative, while it was positive for the US. Finally, the adjusted R-squared was slightly higher for the US compared to Japan, meaning that the model is slightly better at explaining the variance in the data for the US.