1 Background

In this analysis, we use data from Kagle(Kagle, 2023) to illustrate the use of Python and Pandas in data analysis(VanderPlas, 2016). The data captures the top 2000 companies in the world and is available for free upon registration on the Kagle website.

2 Objectives

The primary goal of this exercise is to illustrate how to use Python, pandas and other libraries in Python programming to analyse data for meaningful insights.

Specifically, we seek to;

Rank countries and continents based on the concentration of large companies.
Compute the correlation between company size (in terms of assets) and financial performance.
Uncover regional patterns in company performance based on continent.
Establish the companies, continents with the highest returns on assets and profit margins.

3 Summary of Results

Most large companies in the sample are located in the United States.
The continent with the highest number of large companies is Asia.
There is a positive relationship between company size and financial performance (sales, profits, market value). However, this relationship varies across continents.
Companies from North America and Asia have the highest sales and profits.
Companies from Europe and Asia have the highest asset base.
Companies in Africa and North America are more efficient in converting sales and assets into profits.

Tip

Please visit my [rpubs site](www.rpubs.com/Karuitha) to see more data projects. Alternatively, copy and paste the link <www.rpubs.com/Karuitha> into your browser. You can also view my [linkedin](www.linkedin.com/in/Karuitha) site for my skills and education.

My [Tableau public profile](https://public.tableau.com/app/profile/john.karuitha) contains my data visualizations.

My Shiny web apps are available on this [site](https://karuitha.shinyapps.io/). You can copy-paste this web address instead https://karuitha.shinyapps.io/

Skills & Technologies Applied: Python, Pandas, Geopandas, Matplotlib, Quarto, Data Science.

4 Data

The data is available on kagle as a csv file (registration is required to access the data). I start by loading the necessary packages for the analysis: pandas, numpy, matplotlib, and seaborn (Borjigin, 2023).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point

Next I read in the data using pandas and set the index column as the name of the company given it is a unique and meaningful identifier.

```{python}
company = pd.read_csv("Top2000CompaniesGlobally.csv", 
index_col = "Company")

company.columns
```

Index(['Global Rank', 'Sales ($billion)', 'Profits ($billion)',
       'Assets ($billion)', 'Market Value ($billion)', 'Country', 'Continent',
       'Latitude', 'Longitude'],
      dtype='object')

Next, I rename the columns using names that are easier to work with. I create a dictionary of old names versus new names and then rename the columns.

```{python}
## Create a dictionary of old names vs new names
names = {'Global Rank':'rank', 
'Sales ($billion)':'sales', 
'Profits ($billion)':'profits',
'Assets ($billion)':'assets', 
'Market Value ($billion)':'market_value', 
'Country':'country', 
'Continent':'continent',
'Latitude':'latitude', 
'Longitude':'longitude'}

# Rename the columns in place
company.rename(columns = names, inplace = True)
```

5 Data Exploration

We now explore the data.

5.1 Number of rows and columns

We start with the number of rows and columns in the data.

```{python}
company.shape
```

(1924, 9)

We see that the data has 1924 rows (records) and 9 columns (variables).

5.2 First and Last Rows of the Data

Next, lets examine the first 5 rows.

```{python}
company.head(5)
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company
ICBC	1	134.8	37.8	2813.5	237.3	China	Asia	35.86166	104.195397
China Construction Bank	2	113.1	30.6	2241.0	202.0	China	Asia	35.86166	104.195397
JPMorgan Chase	3	108.2	21.3	2359.1	191.4	USA	North America	37.09024	-95.712891
General Electric	4	147.4	13.6	685.3	243.7	USA	North America	37.09024	-95.712891
Exxon Mobil	5	420.7	44.9	333.8	400.4	USA	North America	37.09024	-95.712891

We also check the last 7 rows.

```{python}
company.tail(7)
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company
Health Net	1993	11.3	0.1	3.9	2.3	USA	North America	37.090240	-95.712891
Bank of Saga	1994	0.5	0.0	25.0	0.4	Japan	Asia	36.204824	138.252924
Tractor Supply	1995	4.7	0.3	1.7	7.1	USA	North America	37.090240	-95.712891
San-Ai Oil	1996	0.5	0.1	25.7	0.5	Japan	Asia	36.204824	138.252924
UOL Group	1996	0.9	0.7	7.8	4.2	Singapore	Asia	1.352083	103.819836
Interconexion Electrica	1998	2.4	0.2	14.6	5.8	Colombia	South America	4.570868	-74.297333
Old Republic International	1999	5.0	-0.1	16.2	3.3	USA	North America	37.090240	-95.712891

5.3 Visual Summary

let us use seaborn pairplot() function to create a summary picture of the data using all variables, colored by country.

```{python}
#| fig-cap: "Pairs plots for the Variables (A)"
sns.pairplot(company[['country', 'continent', 'sales', 'profits', 'assets', 'market_value']], hue = "continent", corner = True, palette = "Dark2", kind = "scatter")
```

```{python}
#| fig-cap: "Pairs plots for the Variables (B)"
sns.pairplot(company[['country', 'continent', 'sales', 'profits', 'assets', 'market_value']], hue = "continent", corner = True, palette = "Dark2", kind = "kde")
```

5.4 Variable Types

The variable types are also an important data component.

```{python}
company.info()
```

<class 'pandas.core.frame.DataFrame'>
Index: 1924 entries, ICBC to Old Republic International
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rank          1924 non-null   int64  
 1   sales         1924 non-null   float64
 2   profits       1924 non-null   float64
 3   assets        1924 non-null   float64
 4   market_value  1924 non-null   float64
 5   country       1924 non-null   object 
 6   continent     1924 non-null   object 
 7   latitude      1924 non-null   float64
 8   longitude     1924 non-null   float64
dtypes: float64(6), int64(1), object(2)
memory usage: 214.9+ KB

```{python}
company.dtypes
```

rank              int64
sales           float64
profits         float64
assets          float64
market_value    float64
country          object
continent        object
latitude        float64
longitude       float64
dtype: object

5.5 Missing Values

Let us check the data for any missing values for sales, profits, assets, market_value, longitude, and latitude.

```{python}
company[company['sales'].isna()]
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company

```{python}
company[company['profits'].isna()]
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company

```{python}
company[company['assets'].isna()]
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company

```{python}
company[company['market_value'].isna()]
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company

```{python}
company[company['longitude'].isna()]
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company

```{python}
company[company['latitude'].isna()]
```

	rank	sales	profits	assets	market_value	country	continent	latitude	longitude
Company

This is a conveniently clean data without missing values which is a rare occurrence in data science.

5.6 Summary Statistics

We use the describe method to make summary statistics for numeric and character variables separately.

```{python}
company.describe(include = "float64")
```

	sales	profits	assets	market_value	latitude	longitude
count	1924.000000	1924.000000	1924.000000	1924.000000	1924.000000	1924.000000
mean	19.265904	1.226040	79.507796	19.558160	34.618747	15.455664
std	34.683911	3.413831	261.098775	32.957023	18.259499	92.639655
min	0.000000	-24.500000	1.000000	0.000000	-40.900557	-106.346771
25%	4.100000	0.300000	9.675000	5.300000	35.861660	-95.712891
50%	9.000000	0.600000	19.250000	9.600000	37.090240	10.451526
75%	18.425000	1.200000	45.800000	19.200000	40.463667	105.318756
max	469.200000	44.900000	3226.200000	416.600000	61.924110	174.885971

We also do the same for string data types.

```{python}
company.describe(include = "object")
```

	country	continent
count	1924	1924
unique	60	6
top	USA	Asia
freq	535	719

Here we see that there are 1924 observations with 60 unique countries across 6 continents. The country with the most companies is the USA (535), while the continent with the most companies is Asia (719).

6 Analysis

In this section, we explore some key issues from the data, listed in order.

6.1 Which Countries and Continents Have the Highest Concentration of Large Companies?

In this section, we utilize the global rank in size variable to identify the countries and continents with the highest number of top-ranking companies. We see that companies from USA, Japan, China, UK, Canada, and South Korea dominate the rankings. Note that these are countries mainly from North America and Asia (with the exception of the UK) which illustrates the dominance of these two continents in global trade.

```{python}
company['country'].value_counts()
```

country
USA                     535
Japan                   246
China                   135
UK                       97
Canada                   65
South Korea              64
France                   64
India                    56
Germany                  50
Hong Kong                46
Australia                42
Taiwan                   40
Brazil                   31
Russia                   30
Italy                    29
Spain                    28
Sweden                   26
The Netherlands          24
Singapore                20
Malaysia                 20
South Africa             19
Mexico                   19
Saudi Arabia             17
Ireland                  17
Thailand                 16
Turkey                   14
United Arab Emirates     14
Finland                  12
Belgium                  11
Norway                   10
Denmark                  10
Israel                   10
Greece                   10
Indonesia                 9
Chile                     9
Bermuda                   9
Philippines               8
Qatar                     8
Poland                    8
Luxembourg                7
Colombia                  6
Portugal                  6
Kuwait                    4
Hungary                   2
Peru                      2
Morocco                   2
Kazakhstan                2
Egypt                     2
Lebanon                   2
Pakistan                  1
Vietnam                   1
Channel Islands           1
Oman                      1
Nigeria                   1
Jordan                    1
Liberia                   1
Venezuela                 1
Panama                    1
Czech Republic            1
New Zealand               1
Name: count, dtype: int64

Just to confirm, let us see the concentration of companies by continent.

```{python}
company['continent'].value_counts()
```

continent
Asia             719
North America    629
Europe           459
South America     49
Oceania           43
Africa            25
Name: count, dtype: int64

We then visualize this information on a world map using latitude and longitude data to show the geographic distribution of these companies. The size of the dots represents profits. We find that while there are the profitable companies are mainly in Asia, Europe, and North America. The rest of the world generates very few profits.

```{python}
# Create a GeoDataFrame
geometry = [Point(lon, lat) for lon, lat in zip(company['longitude'], company['latitude'])]
gdf = gpd.GeoDataFrame(company, geometry=geometry, columns=['company', 'sales', 'profits', 'assets'])
```

```{python}
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
```

```{python}
#| fig-cap: "Top Companies by Sales on a World Map"
# Create a subplot
fig, ax = plt.subplots(figsize=(10, 10))

# Plot the world map
world.boundary.plot(ax=ax, color = "lightgrey")

# Plot your data on the map
gdf.plot(ax=ax, marker='o', color='red', markersize='profits', label='Companies')

# Add labels and legend
ax.set_title('Top Companies by Profits on World Map')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend()

plt.show()
```

6.2 Is There a Correlation Between Company Size (in terms of assets) and Financial Performance?

Here, we analyze the relationship between the asset base of companies, which is a proxy for size and financial metrics such as sales, profits, and market value. The goal is to determine if larger companies tend to have higher sales, profits, and market value, or if there are exceptions. Overall, there appears to be a positive link between the size of a company and its financial performance in terms of sales, profits, and market value. However, this relationship varies by continent.

```{python}
#| fig-cap: "Pairs plots for the Variables (C)"
sns.pairplot(company[["continent", "assets", "sales", "profits", "market_value"]], hue = "continent")
```

```{python}
company[["assets", "sales", "profits", "market_value"]].corr()
```

	assets	sales	profits	market_value
assets	1.000000	0.356069	0.356553	0.375544
sales	0.356069	1.000000	0.594174	0.649493
profits	0.356553	0.594174	1.000000	0.788853
market_value	0.375544	0.649493	0.788853	1.000000

6.3 Are There Regional Patterns in Company Performance Based on Continent?

In this section, I group the data by continent and analyze the average sales, profits, and assets for companies in each continent. North America and Asia have the highest sales and profits. In terms of asset base, companies in Europe and North America lead. This implies that companies in North America are more efficient in converting their sales and assets into profits.

```{python}
#| fig-cap: "Sales by Companies"
company.groupby("continent")["sales"].sum().sort_values().plot(kind = "barh", title = "Total Sales of Top 2000 Companies globally by Continent")
```

<Axes: title={'center': 'Total Sales of Top 2000 Companies globally by Continent'}, ylabel='continent'>

Sales by Companies

```{python}
#| fig-cap: "Total Profits of Top 2000 Companies globally by Continent"
company.groupby("continent")["profits"].sum().sort_values().plot(kind = "barh", title = "Total Profits of Top 2000 Companies globally by Continent")
```

<Axes: title={'center': 'Total Profits of Top 2000 Companies globally by Continent'}, ylabel='continent'>

Total Profits of Top 2000 Companies globally by Continent

```{python}
#| fig-cap: "Total Assets of Top 2000 Companies globally by Continent"
company.groupby("continent")["assets"].sum().sort_values().plot(kind = "barh", title = "Total Assets of Top 2000 Companies globally by Continent")
```

<Axes: title={'center': 'Total Assets of Top 2000 Companies globally by Continent'}, ylabel='continent'>

Total Assets of Top 2000 Companies globally by Continent

6.4 Which Companies, Contiments have the Highest Returns on Assets and Profit margins?

The return on assets is the ratio of profits to assets and captures the efficiency with which a company utilises its assets to generate profits. The net profit margin also captures how well a company is able to convert its sales into profits. The margin can also proxy the management efforts towards expenses management.

Carphone Warehouse with a return on assets of 1 means that it generated profits equivalent to its assets base. This is phenomenal!! The companies in the top 5 all have a return greater than 30%, which is an exceptional performance.

```{python}
company["roa"] = company["profits"] / company["assets"]
company["margin"] = company["profits"] / company["sales"]
```

```{python}
company['roa'].sort_values(ascending = False).head()
```

Company
Carphone Warehouse           1.000000
Regeneron Pharmaceuticals    0.470588
Alexander's                  0.466667
Saudi Arabian Fertilizers    0.370370
Seagate Technology           0.367816
Name: roa, dtype: float64

Next, I compute the median ROA for each continent. Africa leads, though the continent has very few companies in the sample. North America is in second place, which implies that the companies in this continent are very efficient in utilizing assets to generate profits.

```{python}
company.groupby("continent")['roa'].median().sort_values(ascending = False)
```

continent
Africa           0.063492
North America    0.046512
Oceania          0.046512
Europe           0.034483
South America    0.024590
Asia             0.021583
Name: roa, dtype: float64

Let us do the same for profit margin. The companies in the top 5 have ratios greater than 1 which implies they generated profits greater than the sales. This is cause for concern especially for Carphone Warehouse which appears to have generated profits from zero sales.

```{python}
company['margin'].sort_values(ascending = False).head()
```

Company
Carphone Warehouse         inf
Hysan Development     4.333333
Industrivarden        4.250000
Alexander's           3.500000
Chinese Estates       3.250000
Name: margin, dtype: float64

Africa leads in median profit margin. Again this could be die to the low number of companies from the continent in this sample. Oceania and North America follow in that order. Europe does poorly.

```{python}
company.groupby("continent")['margin'].median().sort_values(ascending = False)
```

continent
Africa           0.139423
Oceania          0.107143
North America    0.090909
South America    0.083333
Asia             0.072289
Europe           0.060185
Name: margin, dtype: float64

7 Conclusion

In this analysis, we use data from Kagle to illustrate the use of Python and Pandas in data analysis. From the analysis, we find that;

Most large companies in the sample are located in the United States.
The continent with the highest number of large companies is Asia.
There is a positive relationship between company size and financial performance. However, this relationship varies across continents.
Companies from North America and Asia have the highest sales and profits.
Companies from Europe and Asia have the highest asset base.
Companies in Africa and North America are more efficient in converting sales and assets into profits.

References

Borjigin, C. (2023). Python data science. Springer Nature Singapore. https://doi.org/10.1007/978-981-19-7702-2

Kagle. (2023). Top 2000 companies globally. https://www.kaggle.com/datasets/joebeachcapital/top-2000-companies-globally.

VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. " O’Reilly Media, Inc.".