project02

Author

Hangu Lee

1. Data Exploration

# Load the dataset
forbes <- read.csv("Forbes2000.csv")

# Check the structure and summarized values
str(forbes)

'data.frame':   2000 obs. of  9 variables:
 $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ name       : chr  "Citigroup" "General Electric" "American Intl Group" "ExxonMobil" ...
 $ country    : chr  "United States" "United States" "United States" "United States" ...
 $ category   : chr  "Banking" "Conglomerates" "Insurance" "Oil & gas operations" ...
 $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
 $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
 $ assets     : num  1264 627 648 167 178 ...
 $ marketvalue: num  255 329 195 277 174 ...

summary(forbes)

Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 32 is
invalid UTF-8

Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 54 is
invalid UTF-8

Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 68 is
invalid UTF-8

Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 186 is
invalid UTF-8

Warning in grep("^[ \t\r\n]*$", object, perl = TRUE): input string 278 is
invalid UTF-8

       X               rank               name           country    
 Min.   :   1.0   Min.   :   1.0   Length   :2000   Length   :2000  
 1st Qu.: 500.8   1st Qu.: 500.8   N.unique :2000   N.unique :  61  
 Median :1000.5   Median :1000.5   N.blank  :   0   N.blank  :   0  
 Mean   :1000.5   Mean   :1000.5   Min.nchar:  NA   Min.nchar:   4  
 3rd Qu.:1500.2   3rd Qu.:1500.2   Max.nchar:  NA   Max.nchar:  28  
 Max.   :2000.0   Max.   :2000.0                                    
                                                                    
      category        sales            profits             assets        
 Length   :2000   Min.   :  0.010   Min.   :-25.8300   Min.   :   0.270  
 N.unique :  27   1st Qu.:  2.018   1st Qu.:  0.0800   1st Qu.:   4.025  
 N.blank  :   0   Median :  4.365   Median :  0.2000   Median :   9.345  
 Min.nchar:   5   Mean   :  9.697   Mean   :  0.3811   Mean   :  34.042  
 Max.nchar:  32   3rd Qu.:  9.547   3rd Qu.:  0.4400   3rd Qu.:  22.793  
                  Max.   :256.330   Max.   : 20.9600   Max.   :1264.030  
                                    NAs    :5                            
  marketvalue    
 Min.   :  0.02  
 1st Qu.:  2.72  
 Median :  5.15  
 Mean   : 11.88  
 3rd Qu.: 10.60  
 Max.   :328.54

# Calculate mean and total profits by company category
cat_profits <- aggregate(profits ~ category, data=forbes, FUN=function(x) c(Mean = mean(x, na.rm=TRUE), Total = sum(x, na.rm=TRUE)))
print(cat_profits)

                           category profits.Mean profits.Total
1               Aerospace & defense    0.2884211     5.4800000
2                           Banking    0.4220767   132.1100000
3      Business services & supplies    0.1707143    11.9500000
4                     Capital goods    0.0954717     5.0600000
5                         Chemicals    0.2606000    13.0300000
6                     Conglomerates    1.0145161    31.4500000
7                      Construction    0.1981013    15.6500000
8                 Consumer durables    0.5663514    41.9100000
9            Diversified financials    0.4995570    78.9300000
10            Drugs & biotechnology    1.4477778    65.1500000
11             Food drink & tobacco    0.5938554    49.2900000
12                     Food markets    0.2490909     8.2200000
13 Health care equipment & services    0.3609231    23.4600000
14     Hotels restaurants & leisure    0.2586486     9.5700000
15    Household & personal products    0.5497727    24.1900000
16                        Insurance    0.3430000    37.7300000
17                        Materials    0.1959794    19.0100000
18                            Media    0.2106557    12.8500000
19             Oil & gas operations    1.3055556   117.5000000
20                        Retailing    0.4759091    41.8800000
21                   Semiconductors    0.4365385    11.3500000
22              Software & services    0.5677419    17.6000000
23  Technology hardware & equipment    0.2055932    12.1300000
24      Telecommunications services   -0.9080303   -59.9300000
25                Trading companies    0.0280000     0.7000000
26                   Transportation    0.1388462    10.8300000
27                        Utilities    0.2114545    23.2600000

# Aggregate total profits and sales by country
country_summary <- aggregate(cbind(profits, sales) ~ country, data=forbes, FUN=sum, na.rm=TRUE)

# Sort to find the highest countries
head(country_summary[order(-country_summary$profits), ], 5)

          country profits   sales
60  United States  487.40 7540.27
9          Canada   23.30  360.06
56 United Kingdom   21.72 1425.30
2       Australia   18.08  188.65
49    South Korea   15.60  358.62

head(country_summary[order(-country_summary$sales), ], 5)

          country profits   sales
60  United States  487.40 7540.27
28          Japan    7.07 3220.24
56 United Kingdom   21.72 1425.30
18        Germany   -2.48 1350.79
16         France    7.37 1266.43

Profitability varies considerably across company types. Drugs & Biotechnology and Oil & Gas Operations have the highest average profits, while Telecommunications Services has the lowest average profit and is the only category with negative profits. In terms of total profits, Banking generates the highest overall profits.

By country, the United States generates the greatest total profits (487.40), followed by Canada and the United Kingdom. The United States also records the highest total sales (7540.27), followed by Japan and the United Kingdom.

Overall, profits and sales are concentrated in a few industries and countries, with the United States dominating both measures.

2. Country Comparison: USA vs Japan

# Subset data for USA and Japan
usa_data <- subset(forbes, country == "United States")
japan_data <- subset(forbes, country == "Japan")

# 1. Compare the Best (Minimum) Rank
min(usa_data$rank)

[1] 1

min(japan_data$rank)

[1] 8

# 2. Compare the Average (Mean) Rank
mean(usa_data$rank)

[1] 947.2756

mean(japan_data$rank)

[1] 1144.329

# Frequency of company categories for USA and Japan
head(sort(table(usa_data$category), decreasing=TRUE), 5)


                         Banking           Diversified financials 
                              83                               60 
                       Utilities Health care equipment & services 
                              54                               53 
                       Retailing 
                              53

head(sort(table(japan_data$category), decreasing=TRUE), 5)


               Banking Diversified financials      Consumer durables 
                    69                     24                     22 
        Transportation          Capital goods 
                    20                     19

The United States has the highest-ranked company in the dataset, with a rank of 1, while Japan’s highest-ranked company is ranked 8th. In addition, U.S. companies have a better average rank (947.28) than Japanese companies (1144.33), indicating stronger overall performance because lower rank values represent higher positions.

In both countries, Banking is the most common company category. However, the United States has a larger presence of Diversified Financials, Utilities, Health Care Equipment & Services, and Retailing, whereas Japan has relatively more companies in Consumer Durables, Transportation, and Capital Goods. These results suggest that the U.S. economy is more heavily represented by financial and service-oriented industries, while Japan has stronger representation in manufacturing and industrial sectors.

3. Multiple Linear Regression Model: All Companies

# Build the global multiple linear regression model
model_all <- lm(profits ~ assets + marketvalue + sales, data=forbes)

# Summary for coefficients and R-squared (goodness-of-fit)
summary(model_all)


Call:
lm(formula = profits ~ assets + marketvalue + sales, data = forbes)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.2169  -0.0189   0.1160   0.2107   8.9495 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1186259  0.0380039  -3.121  0.00183 ** 
assets      -0.0008395  0.0003781  -2.220  0.02651 *  
marketvalue  0.0363340  0.0018183  19.982  < 2e-16 ***
sales        0.0098892  0.0024331   4.064    5e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.472 on 1991 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.3059,    Adjusted R-squared:  0.3049 
F-statistic: 292.5 on 3 and 1991 DF,  p-value: < 2.2e-16

# ANOVA for sequential effects
anova(model_all)

Analysis of Variance Table

Response: profits
              Df Sum Sq Mean Sq F value    Pr(>F)    
assets         1  312.8  312.84  144.40 < 2.2e-16 ***
marketvalue    1 1552.8 1552.77  716.71 < 2.2e-16 ***
sales          1   35.8   35.79   16.52 5.001e-05 ***
Residuals   1991 4313.6    2.17                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Residuals diagnostics
hist(residuals(model_all), main="Histogram of Residuals (All Companies)", xlab="Residuals", col="salmon")

plot(model_all, which=1)

Model Summary & Goodness-of-Fit: The multiple linear regression model explains approximately 30.6% of the variation in corporate profits (R² = 0.3059). Market value (β = 0.0363, p < 0.001) and sales (β = 0.0099, p < 0.001) have significant positive effects on profits, while assets have a small but statistically significant negative effect (β = -0.00084, p = 0.0265).

Variable with the Greatest Effect: According to the ANOVA table, market value has the greatest effect on profits, with the largest Sum of Squares (1552.77) and F-statistic (716.71). This suggests that market value is the strongest predictor of profitability among the variables considered.

Residual Diagnostics: The residuals are centered near zero, although the distribution contains several extreme observations. The Residuals vs Fitted plot does not indicate severe departures from linearity, but a few outliers and some variation in residual spread are visible. Overall, the regression assumptions appear reasonably satisfied, though the model may be influenced by several extreme observations.

4. Regional Comparison Models: USA vs Japan

# Build regression model for American companies
model_usa <- lm(profits ~ assets + marketvalue + sales, data=usa_data)
summary(model_usa)


Call:
lm(formula = profits ~ assets + marketvalue + sales, data = usa_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4954 -0.0620  0.1217  0.1953  7.8811 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1465337  0.0336610  -4.353 1.53e-05 ***
assets       0.0043831  0.0003655  11.993  < 2e-16 ***
marketvalue  0.0339639  0.0012747  26.645  < 2e-16 ***
sales        0.0138408  0.0020376   6.793 2.25e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8118 on 744 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.795, Adjusted R-squared:  0.7942 
F-statistic: 961.8 on 3 and 744 DF,  p-value: < 2.2e-16

anova(model_usa)

Analysis of Variance Table

Response: profits
             Df Sum Sq Mean Sq  F value    Pr(>F)    
assets        1 975.23  975.23 1479.783 < 2.2e-16 ***
marketvalue   1 895.96  895.96 1359.498 < 2.2e-16 ***
sales         1  30.41   30.41   46.142 2.253e-11 ***
Residuals   744 490.32    0.66                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Build regression model for Japanese companies
model_japan <- lm(profits ~ assets + marketvalue + sales, data=japan_data)
summary(model_japan)


Call:
lm(formula = profits ~ assets + marketvalue + sales, data = japan_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2316 -0.1651  0.0184  0.2160  5.3909 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0731901  0.0551181  -1.328    0.185    
assets      -0.0126813  0.0004999 -25.368   <2e-16 ***
marketvalue  0.0765111  0.0062517  12.239   <2e-16 ***
sales       -0.0006585  0.0034725  -0.190    0.850    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8108 on 312 degrees of freedom
Multiple R-squared:  0.6857,    Adjusted R-squared:  0.6827 
F-statistic: 226.9 on 3 and 312 DF,  p-value: < 2.2e-16

anova(model_japan)

Analysis of Variance Table

Response: profits
             Df  Sum Sq Mean Sq F value Pr(>F)    
assets        1 284.846 284.846 433.311 <2e-16 ***
marketvalue   1 162.574 162.574 247.310 <2e-16 ***
sales         1   0.024   0.024   0.036 0.8497    
Residuals   312 205.100   0.657                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both models indicate that assets and market value are the most important predictors of profits. According to the ANOVA results, assets have the largest effect in both the United States (F = 1479.78) and Japan (F = 433.31). A key difference is that sales significantly affect profits in the U.S. model (p < 0.001) but are not significant in the Japanese model (p = 0.85). The U.S. model also has a higher R² (0.795) than the Japanese model (0.686), indicating stronger predictive performance.