import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point
1 Background
In this analysis, we use data from Kagle(Kagle, 2023) to illustrate the use of Python and Pandas in data analysis(VanderPlas, 2016). The data captures the top 2000 companies in the world and is available for free upon registration on the Kagle website.
2 Objectives
The primary goal of this exercise is to illustrate how to use Python, pandas and other libraries in Python programming to analyse data for meaningful insights.
Specifically, we seek to;
- Rank countries and continents based on the concentration of large companies.
- Compute the correlation between company size (in terms of assets) and financial performance.
- Uncover regional patterns in company performance based on continent.
- Establish the companies, continents with the highest returns on assets and profit margins.
3 Summary of Results
- Most large companies in the sample are located in the United States.
- The continent with the highest number of large companies is Asia.
- There is a positive relationship between company size and financial performance (sales, profits, market value). However, this relationship varies across continents.
- Companies from North America and Asia have the highest sales and profits.
- Companies from Europe and Asia have the highest asset base.
- Companies in Africa and North America are more efficient in converting sales and assets into profits.
Please visit my [rpubs site](www.rpubs.com/Karuitha) to see more data projects. Alternatively, copy and paste the link <www.rpubs.com/Karuitha> into your browser. You can also view my [linkedin](www.linkedin.com/in/Karuitha) site for my skills and education.
My [Tableau public profile](https://public.tableau.com/app/profile/john.karuitha) contains my data visualizations.
My Shiny web apps are available on this [site](https://karuitha.shinyapps.io/). You can copy-paste this web address instead https://karuitha.shinyapps.io/
Skills & Technologies Applied: Python, Pandas, Geopandas, Matplotlib, Quarto, Data Science.
4 Data
The data is available on kagle as a csv file (registration is required to access the data). I start by loading the necessary packages for the analysis: pandas, numpy, matplotlib, and seaborn (Borjigin, 2023).
Next I read in the data using pandas and set the index column as the name of the company given it is a unique and meaningful identifier.
```{python}
= pd.read_csv("Top2000CompaniesGlobally.csv",
company = "Company")
index_col
company.columns```
Index(['Global Rank', 'Sales ($billion)', 'Profits ($billion)',
'Assets ($billion)', 'Market Value ($billion)', 'Country', 'Continent',
'Latitude', 'Longitude'],
dtype='object')
Next, I rename the columns using names that are easier to work with. I create a dictionary of old names versus new names and then rename the columns.
```{python}
## Create a dictionary of old names vs new names
= {'Global Rank':'rank',
names 'Sales ($billion)':'sales',
'Profits ($billion)':'profits',
'Assets ($billion)':'assets',
'Market Value ($billion)':'market_value',
'Country':'country',
'Continent':'continent',
'Latitude':'latitude',
'Longitude':'longitude'}
# Rename the columns in place
= names, inplace = True)
company.rename(columns ```
5 Data Exploration
We now explore the data.
5.1 Number of rows and columns
We start with the number of rows and columns in the data.
```{python}
company.shape```
(1924, 9)
We see that the data has 1924 rows (records) and 9 columns (variables).
5.2 First and Last Rows of the Data
Next, lets examine the first 5 rows.
```{python}
5)
company.head(```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company | |||||||||
ICBC | 1 | 134.8 | 37.8 | 2813.5 | 237.3 | China | Asia | 35.86166 | 104.195397 |
China Construction Bank | 2 | 113.1 | 30.6 | 2241.0 | 202.0 | China | Asia | 35.86166 | 104.195397 |
JPMorgan Chase | 3 | 108.2 | 21.3 | 2359.1 | 191.4 | USA | North America | 37.09024 | -95.712891 |
General Electric | 4 | 147.4 | 13.6 | 685.3 | 243.7 | USA | North America | 37.09024 | -95.712891 |
Exxon Mobil | 5 | 420.7 | 44.9 | 333.8 | 400.4 | USA | North America | 37.09024 | -95.712891 |
We also check the last 7 rows.
```{python}
7)
company.tail(```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company | |||||||||
Health Net | 1993 | 11.3 | 0.1 | 3.9 | 2.3 | USA | North America | 37.090240 | -95.712891 |
Bank of Saga | 1994 | 0.5 | 0.0 | 25.0 | 0.4 | Japan | Asia | 36.204824 | 138.252924 |
Tractor Supply | 1995 | 4.7 | 0.3 | 1.7 | 7.1 | USA | North America | 37.090240 | -95.712891 |
San-Ai Oil | 1996 | 0.5 | 0.1 | 25.7 | 0.5 | Japan | Asia | 36.204824 | 138.252924 |
UOL Group | 1996 | 0.9 | 0.7 | 7.8 | 4.2 | Singapore | Asia | 1.352083 | 103.819836 |
Interconexion Electrica | 1998 | 2.4 | 0.2 | 14.6 | 5.8 | Colombia | South America | 4.570868 | -74.297333 |
Old Republic International | 1999 | 5.0 | -0.1 | 16.2 | 3.3 | USA | North America | 37.090240 | -95.712891 |
5.3 Visual Summary
let us use seaborn pairplot() function to create a summary picture of the data using all variables, colored by country.
```{python}
#| fig-cap: "Pairs plots for the Variables (A)"
'country', 'continent', 'sales', 'profits', 'assets', 'market_value']], hue = "continent", corner = True, palette = "Dark2", kind = "scatter")
sns.pairplot(company[[```
```{python}
#| fig-cap: "Pairs plots for the Variables (B)"
'country', 'continent', 'sales', 'profits', 'assets', 'market_value']], hue = "continent", corner = True, palette = "Dark2", kind = "kde")
sns.pairplot(company[[```
5.4 Variable Types
The variable types are also an important data component.
```{python}
company.info()```
<class 'pandas.core.frame.DataFrame'>
Index: 1924 entries, ICBC to Old Republic International
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rank 1924 non-null int64
1 sales 1924 non-null float64
2 profits 1924 non-null float64
3 assets 1924 non-null float64
4 market_value 1924 non-null float64
5 country 1924 non-null object
6 continent 1924 non-null object
7 latitude 1924 non-null float64
8 longitude 1924 non-null float64
dtypes: float64(6), int64(1), object(2)
memory usage: 214.9+ KB
```{python}
company.dtypes```
rank int64
sales float64
profits float64
assets float64
market_value float64
country object
continent object
latitude float64
longitude float64
dtype: object
5.5 Missing Values
Let us check the data for any missing values for sales, profits, assets, market_value, longitude, and latitude.
```{python}
'sales'].isna()]
company[company[```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company |
```{python}
'profits'].isna()]
company[company[```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company |
```{python}
'assets'].isna()]
company[company[```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company |
```{python}
'market_value'].isna()]
company[company[```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company |
```{python}
'longitude'].isna()]
company[company[```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company |
```{python}
'latitude'].isna()]
company[company[```
rank | sales | profits | assets | market_value | country | continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
Company |
This is a conveniently clean data without missing values which is a rare occurrence in data science.
5.6 Summary Statistics
We use the describe method to make summary statistics for numeric and character variables separately.
```{python}
= "float64")
company.describe(include ```
sales | profits | assets | market_value | latitude | longitude | |
---|---|---|---|---|---|---|
count | 1924.000000 | 1924.000000 | 1924.000000 | 1924.000000 | 1924.000000 | 1924.000000 |
mean | 19.265904 | 1.226040 | 79.507796 | 19.558160 | 34.618747 | 15.455664 |
std | 34.683911 | 3.413831 | 261.098775 | 32.957023 | 18.259499 | 92.639655 |
min | 0.000000 | -24.500000 | 1.000000 | 0.000000 | -40.900557 | -106.346771 |
25% | 4.100000 | 0.300000 | 9.675000 | 5.300000 | 35.861660 | -95.712891 |
50% | 9.000000 | 0.600000 | 19.250000 | 9.600000 | 37.090240 | 10.451526 |
75% | 18.425000 | 1.200000 | 45.800000 | 19.200000 | 40.463667 | 105.318756 |
max | 469.200000 | 44.900000 | 3226.200000 | 416.600000 | 61.924110 | 174.885971 |
We also do the same for string data types.
```{python}
= "object")
company.describe(include ```
country | continent | |
---|---|---|
count | 1924 | 1924 |
unique | 60 | 6 |
top | USA | Asia |
freq | 535 | 719 |
Here we see that there are 1924 observations with 60 unique countries across 6 continents. The country with the most companies is the USA (535), while the continent with the most companies is Asia (719).
6 Analysis
In this section, we explore some key issues from the data, listed in order.
6.1 Which Countries and Continents Have the Highest Concentration of Large Companies?
In this section, we utilize the global rank in size variable to identify the countries and continents with the highest number of top-ranking companies. We see that companies from USA, Japan, China, UK, Canada, and South Korea dominate the rankings. Note that these are countries mainly from North America and Asia (with the exception of the UK) which illustrates the dominance of these two continents in global trade.
```{python}
'country'].value_counts()
company[```
country
USA 535
Japan 246
China 135
UK 97
Canada 65
South Korea 64
France 64
India 56
Germany 50
Hong Kong 46
Australia 42
Taiwan 40
Brazil 31
Russia 30
Italy 29
Spain 28
Sweden 26
The Netherlands 24
Singapore 20
Malaysia 20
South Africa 19
Mexico 19
Saudi Arabia 17
Ireland 17
Thailand 16
Turkey 14
United Arab Emirates 14
Finland 12
Belgium 11
Norway 10
Denmark 10
Israel 10
Greece 10
Indonesia 9
Chile 9
Bermuda 9
Philippines 8
Qatar 8
Poland 8
Luxembourg 7
Colombia 6
Portugal 6
Kuwait 4
Hungary 2
Peru 2
Morocco 2
Kazakhstan 2
Egypt 2
Lebanon 2
Pakistan 1
Vietnam 1
Channel Islands 1
Oman 1
Nigeria 1
Jordan 1
Liberia 1
Venezuela 1
Panama 1
Czech Republic 1
New Zealand 1
Name: count, dtype: int64
Just to confirm, let us see the concentration of companies by continent.
```{python}
'continent'].value_counts()
company[```
continent
Asia 719
North America 629
Europe 459
South America 49
Oceania 43
Africa 25
Name: count, dtype: int64
We then visualize this information on a world map using latitude and longitude data to show the geographic distribution of these companies. The size of the dots represents profits. We find that while there are the profitable companies are mainly in Asia, Europe, and North America. The rest of the world generates very few profits.
```{python}
# Create a GeoDataFrame
= [Point(lon, lat) for lon, lat in zip(company['longitude'], company['latitude'])]
geometry = gpd.GeoDataFrame(company, geometry=geometry, columns=['company', 'sales', 'profits', 'assets'])
gdf ```
```{python}
= gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world ```
```{python}
#| fig-cap: "Top Companies by Sales on a World Map"
# Create a subplot
= plt.subplots(figsize=(10, 10))
fig, ax
# Plot the world map
=ax, color = "lightgrey")
world.boundary.plot(ax
# Plot your data on the map
=ax, marker='o', color='red', markersize='profits', label='Companies')
gdf.plot(ax
# Add labels and legend
'Top Companies by Profits on World Map')
ax.set_title('Longitude')
ax.set_xlabel('Latitude')
ax.set_ylabel(
ax.legend()
plt.show()```
6.2 Is There a Correlation Between Company Size (in terms of assets) and Financial Performance?
Here, we analyze the relationship between the asset base of companies, which is a proxy for size and financial metrics such as sales, profits, and market value. The goal is to determine if larger companies tend to have higher sales, profits, and market value, or if there are exceptions. Overall, there appears to be a positive link between the size of a company and its financial performance in terms of sales, profits, and market value. However, this relationship varies by continent.
```{python}
#| fig-cap: "Pairs plots for the Variables (C)"
"continent", "assets", "sales", "profits", "market_value"]], hue = "continent")
sns.pairplot(company[[```
```{python}
"assets", "sales", "profits", "market_value"]].corr()
company[[```
assets | sales | profits | market_value | |
---|---|---|---|---|
assets | 1.000000 | 0.356069 | 0.356553 | 0.375544 |
sales | 0.356069 | 1.000000 | 0.594174 | 0.649493 |
profits | 0.356553 | 0.594174 | 1.000000 | 0.788853 |
market_value | 0.375544 | 0.649493 | 0.788853 | 1.000000 |
6.3 Are There Regional Patterns in Company Performance Based on Continent?
In this section, I group the data by continent and analyze the average sales, profits, and assets for companies in each continent. North America and Asia have the highest sales and profits. In terms of asset base, companies in Europe and North America lead. This implies that companies in North America are more efficient in converting their sales and assets into profits.
```{python}
#| fig-cap: "Sales by Companies"
"continent")["sales"].sum().sort_values().plot(kind = "barh", title = "Total Sales of Top 2000 Companies globally by Continent")
company.groupby(```
<Axes: title={'center': 'Total Sales of Top 2000 Companies globally by Continent'}, ylabel='continent'>
Sales by Companies
```{python}
#| fig-cap: "Total Profits of Top 2000 Companies globally by Continent"
"continent")["profits"].sum().sort_values().plot(kind = "barh", title = "Total Profits of Top 2000 Companies globally by Continent")
company.groupby(```
<Axes: title={'center': 'Total Profits of Top 2000 Companies globally by Continent'}, ylabel='continent'>
Total Profits of Top 2000 Companies globally by Continent
```{python}
#| fig-cap: "Total Assets of Top 2000 Companies globally by Continent"
"continent")["assets"].sum().sort_values().plot(kind = "barh", title = "Total Assets of Top 2000 Companies globally by Continent")
company.groupby(```
<Axes: title={'center': 'Total Assets of Top 2000 Companies globally by Continent'}, ylabel='continent'>
Total Assets of Top 2000 Companies globally by Continent
6.4 Which Companies, Contiments have the Highest Returns on Assets and Profit margins?
The return on assets is the ratio of profits to assets and captures the efficiency with which a company utilises its assets to generate profits. The net profit margin also captures how well a company is able to convert its sales into profits. The margin can also proxy the management efforts towards expenses management.
Carphone Warehouse with a return on assets of 1 means that it generated profits equivalent to its assets base. This is phenomenal!! The companies in the top 5 all have a return greater than 30%, which is an exceptional performance.
```{python}
"roa"] = company["profits"] / company["assets"]
company["margin"] = company["profits"] / company["sales"]
company[```
```{python}
'roa'].sort_values(ascending = False).head()
company[```
Company
Carphone Warehouse 1.000000
Regeneron Pharmaceuticals 0.470588
Alexander's 0.466667
Saudi Arabian Fertilizers 0.370370
Seagate Technology 0.367816
Name: roa, dtype: float64
Next, I compute the median ROA for each continent. Africa leads, though the continent has very few companies in the sample. North America is in second place, which implies that the companies in this continent are very efficient in utilizing assets to generate profits.
```{python}
"continent")['roa'].median().sort_values(ascending = False)
company.groupby(```
continent
Africa 0.063492
North America 0.046512
Oceania 0.046512
Europe 0.034483
South America 0.024590
Asia 0.021583
Name: roa, dtype: float64
Let us do the same for profit margin. The companies in the top 5 have ratios greater than 1 which implies they generated profits greater than the sales. This is cause for concern especially for Carphone Warehouse which appears to have generated profits from zero sales.
```{python}
'margin'].sort_values(ascending = False).head()
company[```
Company
Carphone Warehouse inf
Hysan Development 4.333333
Industrivarden 4.250000
Alexander's 3.500000
Chinese Estates 3.250000
Name: margin, dtype: float64
Africa leads in median profit margin. Again this could be die to the low number of companies from the continent in this sample. Oceania and North America follow in that order. Europe does poorly.
```{python}
"continent")['margin'].median().sort_values(ascending = False)
company.groupby(```
continent
Africa 0.139423
Oceania 0.107143
North America 0.090909
South America 0.083333
Asia 0.072289
Europe 0.060185
Name: margin, dtype: float64
7 Conclusion
In this analysis, we use data from Kagle to illustrate the use of Python and Pandas in data analysis. From the analysis, we find that;
- Most large companies in the sample are located in the United States.
- The continent with the highest number of large companies is Asia.
- There is a positive relationship between company size and financial performance. However, this relationship varies across continents.
- Companies from North America and Asia have the highest sales and profits.
- Companies from Europe and Asia have the highest asset base.
- Companies in Africa and North America are more efficient in converting sales and assets into profits.