library(tidyverse)
library(dplyr)
library(readr)
library(readxl)
library(data.table)
library(DT)
library(dataMaid)
library(PerformanceAnalytics)
library(ggplot2)
library(ggstatsplot)
library(ggpubr)
library(sjPlot)
library(sjstats)
library(gridExtra)
library(knitr)
library(broom)
library(lattice)
library(gtsummary)
library(maps)
library(mapproj)
companies=read_csv("~/Desktop/MSc in Business Analytics/Programming for Business Analytics/Neoma_group 2/Fortune_1000.csv")
The Fortune 1000 are the 1,000 largest American companies ranked by revenues, as compiled by the American business magazine Fortune. It only includes companies which are incorporated or authorized to do business in the United States, and for which revenues are publicly available. It contains U.S. company data for the year 2022 (updated to 2021). The dataset collects 18 variables (columns) for 1000 observations (rows).
dim(companies)
## [1] 1000 18
The observed variables are the following:
companies %>%
datatable(options = list(scrollX = TRUE))
In the following sections of the project, the dataset will be subject
to data preparation and cleaning in order to be appropriate to perform
an Exploratory Data Analysis (EDA) and to find relevant patterns among
the variables and to answer the following questions:
1. Investigation of the relationship between market capitalization and
profits
2. Do revenues depend on the number of employees?
3. Is there a pattern among sectors in rank changing? Are there sectors
that on average have gained/lost positions in the ranking?
4. Analysis of the pattern between revenues and profits. Are there
sectors that have a higher impact of costs? Profit margin analysis per
sector
5. Analysis of the best performers per sector VS gender of the CEO
6. Analysis of revenue per state
Data preparation steps are here provided according to the scope of the analysis. The following chunk aims at providing a brief summary of the data while highlighting possible errors like inappropriate data types, missing values, duplicated values, and so on.
companies %>%
summarize()
## $company
## $company$variableType
## Variable type: character
## $company$countMissing
## Number of missing obs.: 0 (0 %)
## $company$uniqueValues
## Number of unique values: 1000
## $company$centralValue
## Mode: "1-800-Flowers.com"
##
## $rank
## $rank$variableType
## Variable type: numeric
## $rank$countMissing
## Number of missing obs.: 0 (0 %)
## $rank$uniqueValues
## Number of unique values: 997
## $rank$centralValue
## Median: 500.5
## $rank$quartiles
## 1st and 3rd quartiles: 250.75; 750.25
## $rank$minMax
## Min. and max.: 1; 1000
##
## $rank_change
## $rank_change$variableType
## Variable type: numeric
## $rank_change$countMissing
## Number of missing obs.: 0 (0 %)
## $rank_change$uniqueValues
## Number of unique values: 151
## $rank_change$centralValue
## Median: 0
## $rank_change$quartiles
## 1st and 3rd quartiles: -3; 0
## $rank_change$minMax
## Min. and max.: -178; 295
##
## $revenue
## $revenue$variableType
## Variable type: numeric
## $revenue$countMissing
## Number of missing obs.: 0 (0 %)
## $revenue$uniqueValues
## Number of unique values: 993
## $revenue$centralValue
## Median: 6375.5
## $revenue$quartiles
## 1st and 3rd quartiles: 3500.75; 14615.47
## $revenue$minMax
## Min. and max.: 2107.2; 572754
##
## $profit
## $profit$variableType
## Variable type: numeric
## $profit$countMissing
## Number of missing obs.: 3 (0.3 %)
## $profit$uniqueValues
## Number of unique values: 964
## $profit$centralValue
## Median: 572
## $profit$quartiles
## 1st and 3rd quartiles: 195.4; 1498.3
## $profit$minMax
## Min. and max.: -6520; 94680
##
## $`num. of employees`
## $`num. of employees`$variableType
## Variable type: numeric
## $`num. of employees`$countMissing
## Number of missing obs.: 1 (0.1 %)
## $`num. of employees`$uniqueValues
## Number of unique values: 762
## $`num. of employees`$centralValue
## Median: 13530
## $`num. of employees`$quartiles
## 1st and 3rd quartiles: 6500; 28900
## $`num. of employees`$minMax
## Min. and max.: 160; 2300000
##
## $sector
## $sector$variableType
## Variable type: character
## $sector$countMissing
## Number of missing obs.: 0 (0 %)
## $sector$uniqueValues
## Number of unique values: 21
## $sector$centralValue
## Mode: "Financials"
##
## $city
## $city$variableType
## Variable type: character
## $city$countMissing
## Number of missing obs.: 0 (0 %)
## $city$uniqueValues
## Number of unique values: 400
## $city$centralValue
## Mode: "New York"
##
## $state
## $state$variableType
## Variable type: character
## $state$countMissing
## Number of missing obs.: 0 (0 %)
## $state$uniqueValues
## Number of unique values: 46
## $state$centralValue
## Mode: "CA"
##
## $newcomer
## $newcomer$variableType
## Variable type: character
## $newcomer$countMissing
## Number of missing obs.: 0 (0 %)
## $newcomer$uniqueValues
## Number of unique values: 2
## $newcomer$centralValue
## Mode: "no"
##
## $ceo_founder
## $ceo_founder$variableType
## Variable type: character
## $ceo_founder$countMissing
## Number of missing obs.: 0 (0 %)
## $ceo_founder$uniqueValues
## Number of unique values: 2
## $ceo_founder$centralValue
## Mode: "no"
##
## $ceo_woman
## $ceo_woman$variableType
## Variable type: character
## $ceo_woman$countMissing
## Number of missing obs.: 0 (0 %)
## $ceo_woman$uniqueValues
## Number of unique values: 2
## $ceo_woman$centralValue
## Mode: "no"
##
## $profitable
## $profitable$variableType
## Variable type: character
## $profitable$countMissing
## Number of missing obs.: 0 (0 %)
## $profitable$uniqueValues
## Number of unique values: 2
## $profitable$centralValue
## Mode: "yes"
##
## $prev_rank
## $prev_rank$variableType
## Variable type: numeric
## $prev_rank$countMissing
## Number of missing obs.: 531 (53.1 %)
## $prev_rank$uniqueValues
## Number of unique values: 468
## $prev_rank$centralValue
## Median: 237
## $prev_rank$quartiles
## 1st and 3rd quartiles: 118; 355
## $prev_rank$minMax
## Min. and max.: 1; 498
##
## $CEO
## $CEO$variableType
## Variable type: character
## $CEO$countMissing
## Number of missing obs.: 0 (0 %)
## $CEO$uniqueValues
## Number of unique values: 994
## $CEO$centralValue
## Mode: "#REF!"
##
## $Website
## $Website$variableType
## Variable type: character
## $Website$countMissing
## Number of missing obs.: 0 (0 %)
## $Website$uniqueValues
## Number of unique values: 1000
## $Website$centralValue
## Mode: "https://andersonsinc.com"
##
## $Ticker
## $Ticker$variableType
## Variable type: character
## $Ticker$countMissing
## Number of missing obs.: 49 (4.9 %)
## $Ticker$uniqueValues
## Number of unique values: 951
## $Ticker$centralValue
## Mode: "A"
##
## $`Market Cap`
## $`Market Cap`$variableType
## Variable type: character
## $`Market Cap`$countMissing
## Number of missing obs.: 31 (3.1 %)
## $`Market Cap`$uniqueValues
## Number of unique values: 959
## $`Market Cap`$centralValue
## Mode: "-"
Despite the above-not-so-friendly user output, this kind of summary is preferable for inspection purposes, since it is able to group and show all the needed information.
The first cleaning operation are the following:
companies=companies %>%
mutate(`num. of employees`=as.integer(`num. of employees`),
newcomer=as.logical(ifelse(newcomer=="no",0,1)),
ceo_founder=as.logical(
ifelse(ceo_founder=="no",0,1)),
ceo_woman=as.logical(ifelse(ceo_woman=="no",0,1)),
profitable=as.logical(ifelse(profitable=="no",0,1)),
`Market Cap`=as.numeric(companies$`Market Cap`)) %>%
select(-profitable, -prev_rank, -Website, - CEO) %>%
relocate(c("rank", "rank_change"), .before="company")
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
As visible from the warning message, NAs values were introduced. Thus, the next pivotal step is to find missing values (NAs) and to assess whether they are reasonable and acceptable or whether they need to be managed in some way.
summary(is.na(companies))
## rank rank_change company revenue
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1000 FALSE:1000 FALSE:1000 FALSE:1000
##
## profit num. of employees sector city
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:997 FALSE:999 FALSE:1000 FALSE:1000
## TRUE :3 TRUE :1
## state newcomer ceo_founder ceo_woman
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1000 FALSE:1000 FALSE:1000 FALSE:1000
##
## Ticker Market Cap
## Mode :logical Mode :logical
## FALSE:951 FALSE:961
## TRUE :49 TRUE :39
Ticker and Market Cap columns might display missing values when the company is private. Therefore, we would expect them to have the same number of NAs, but this is not the case. In particular, it seems that Ticker has 10 exceeding NAs, that need to be filled. By filtering the observations with the double condition that Ticker is NA and Market Cap is not NA, we can find the observations that we need to manage. We decided to solve the problem by searching for their tickers in the Internet and by inserting them accordingly.
companies %>%
filter(is.na(companies$Ticker) & !is.na(companies$`Market Cap`)) %>%
select(company, Ticker, `Market Cap`) %>%
datatable(options = list(scrollX = TRUE))
Tickerdf=data.frame(Ticker1=c("CERN", "CNR", "TWTR", "SAFM", "ANAT", "UFS", "MTOR", "REGI", "CHNG", "ZNGA"))
companiesDF=companies %>%
filter(is.na(companies$Ticker) & !is.na(companies$`Market Cap`)) %>%
bind_cols(Tickerdf)%>%
select(-Ticker)%>%
rename("Ticker"="Ticker1")
companies_all=companies %>%
anti_join(companiesDF, by="rank") %>%
bind_rows(companiesDF)
As shown below, now the number of NAs is the same for the two columns and column Ticker column is filled as desired.
companies_all %>%
filter(Ticker=="CERN" | Ticker=="CNR" | Ticker=="TWTR" | Ticker=="SAFM" | Ticker=="ANAT" | Ticker=="UFS" | Ticker=="MTOR" | Ticker=="REGI" | Ticker=="CHNG" | Ticker=="ZNGA") %>%
select(company, Ticker, `Market Cap`) %>%
datatable(options = list(scrollX = TRUE))
NAs are correct only when displayed for both Market Cap and Ticker, because it means the company is private. However, to make the dataset easy to read and to interpret, it might be desirable to insert “Private company” in Ticker and 0 as Market Cap for private companies.
companies_all=companies_all %>%
mutate(Ticker=ifelse(is.na(Ticker), "Private company", Ticker),
`Market Cap` =ifelse(is.na(`Market Cap`), 0, `Market Cap`))
companies_all %>%
filter(is.na(Ticker) | is.na(`Market Cap`)) %>%
datatable(options = list(scrollX = TRUE))
We continue the data cleaning by renaming the columns as desired.
colnames(companies_all)=c("Rank", "Rank_change", "Company", "Revenue", "Profit", "Number_employees", "Sector", "City", "State", "Newcomer", "CEO_founder", "CEO_woman", "Ticker", "Market_cap")
As shown in the first summary performed, missing valued are observed also for some quantitative columns that will be used afterwards to highlight relevant patterns. In this chunk we correct the problem of missing values in profit and number of employees by forcing the mean value of the variable. The statistic used is the trimmed mean, as it is not affected by possible outliers.
companies_all %>%
filter(is.na(Profit) | is.na(Number_employees)) %>%
select(Company, Profit, Number_employees) %>%
datatable(options = list(scrollX = TRUE))
companies_all=companies_all %>%
mutate(Profit=ifelse(is.na(Profit),
mean(Profit, trim=0.05 , na.rm = TRUE),
Profit),
Number_employees=as.integer(ifelse(
is.na(Number_employees),
round(mean(Number_employees, trim=0.05,
na.rm = TRUE), 0),
Number_employees)))
The summary showed 3 duplicated values for the ranking column, but it does not make sense: each company should have a unique ranking position.
companies_all %>%
filter(duplicated(Rank)) %>%
datatable(options = list(scrollX = TRUE))
companies_all=companies_all %>%
mutate(Rank=ifelse(duplicated(Rank), Rank+1, Rank))
companies_all %>%
filter(duplicated(Rank)) %>%
datatable(options = list(scrollX = TRUE))
Sectors will be a core variable in the identification of patterns in the following sections of the project. Therefore, it is important to inspect the variable and to understand how many distinct sectors are there. We see that the string “Food” appears in two sectors, that can be grouped together as a single one. This is desirable because otherwise the analysis might be misleading: if we looked for patterns by considering only one of the two sectors, we would be missing part of the observations belonging to the “Food” sector.
unique(companies_all$Sector) %>%
as_tibble()
## # A tibble: 21 × 1
## value
## <chr>
## 1 Retailing
## 2 Technology
## 3 Health Care
## 4 Energy
## 5 Financials
## 6 Telecommunications
## 7 Food & Drug Stores
## 8 Motor Vehicles & Parts
## 9 Transportation
## 10 Food, Beverages & Tobacco
## # … with 11 more rows
wrong_sp=companies_all %>%
filter(grepl("^Food", Sector)) %>%
select(Sector) %>%
unlist()
companies_all=companies_all %>%
mutate(Sector=ifelse(Sector %in% wrong_sp, "Food, Beverages, Tobacco & Drug Stores", Sector))
companies_all %>%
filter(grepl("^Food", Sector)) %>%
select(Company, Sector) %>%
datatable(options = list(scrollX = TRUE))
The EDA starts with a summary of the quantitative variables in a table to show the main statistics and with an histogram for graphical representation of the mean value.
companies_all[, c("Revenue", "Profit", "Number_employees", "Market_cap")] %>%
sapply(function(x) round(summary(x), 2)) %>%
datatable(options = list(scrollX = TRUE))
EDA_chart=companies_all %>%
select(Market_cap, Revenue, Profit, Number_employees)
EDA_chart %>%
summarise_all(mean) %>%
gather() %>%
ggplot(aes(x = key, y = value)) +
geom_col(fill = "grey") +
coord_flip() +
labs(x = "Variables", y = "Mean Value")
How are these quantitative variables distributed? Are there outliers? Box plots can provide an answer to these questions.
boxplot(EDA_chart$Revenue)
boxplot(EDA_chart$Profit)
boxplot(EDA_chart$Number_employees)
boxplot(EDA_chart$Market_cap)
All the variables are clearly affected by a small number of observations that display huge values (outliers) compared to the majority of the observations. These values might create problems when building predictive models.
It is also interesting, before discovering patterns and relationship, to have a first look at the correlation among variables.
suppressWarnings(EDA_chart %>%
chart.Correlation())
We can notice an outstanding positive correlation between Revenue - Number_employees and between Profit - Market_cap. It is worth mentioning that correlation in not causation, hence, the statistic does not provide information about the direction of the causality. It just indicates that the two variables follows the same pattern, but this might also be due to another unobserved correlated variable.
One potential pattern worth exploring is the relationship between market capitalization and profits. This insight results from the deepening of the previous EDA, in which a strong correlation between the above-mentioned variables have emerged.
In order to investigate further and allow an automation of the task for future analysis with dashboards, a user-defined function is created. The latter, starting from a data frame and two variables, it will firstly investigate whether the correlation between the variables is strong enough. Afterwards, it will generate the related plot and the regression output to allow a better investigation of the problem.
cor(companies_all$Profit, companies_all$Market_cap)
## [1] 0.8318917
Linear_relationship=function(data, x, y) {
#eligibility test
if(cor(x, y)<0.6) {
stop("Correlation between specified variables is too weak")
}
library(ggplot2)
library(PerformanceAnalytics)
library(ggpubr)
library(knitr)
library(gridExtra)
library(sjPlot)
library(sjstats)
library(DT)
library(broom)
# #first table
firsttable=data %>%
ggplot() +
aes(x = x, y = y) +
geom_point(shape = "bullet", size = 2, colour = "#112446") +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Linear relationship between x and y",
subtitle = paste("Correlation =", round(cor(x, y), 2))) +
theme_linedraw()
print(firsttable)
#second table
secondtable=data.frame(x, y)
suppressWarnings(chart.Correlation(secondtable))
#third table
thirdtable=lm(y~x, data = data)
summary(thirdtable)
}
Linear_relationship(companies_all, companies_all$Profit, companies_all$Market_cap)
## `geom_smooth()` using formula = 'y ~ x'
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -877891 -8938 -3494 5987 913738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4684.9712 2368.6001 1.978 0.0482 *
## x 16.6899 0.3524 47.357 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71430 on 998 degrees of freedom
## Multiple R-squared: 0.692, Adjusted R-squared: 0.6917
## F-statistic: 2243 on 1 and 998 DF, p-value: < 2.2e-16
In order to be sure that the defined function is correctly coded, a test is performed to see whether the output is the one expected when the variables show a correlation lower than the threshold (60%).
cor(companies_all$Profit, companies_all$Number_employees)
## [1] 0.3375949
Linear_relationship(companies_all, companies_all$Profit, companies_all$Number_employees)
## Error in Linear_relationship(companies_all, companies_all$Profit, companies_all$Number_employees): Correlation between specified variables is too weak
As stated before, it is possible to see there is a strong correlation (83%) which is statistically significant. Moreover, the linear relationship seems to be positive, even if it is clear that there are severe outliers in the data set considered. To better clarify this step, we can analyse the histogram that relates the corporate profits and the related frequency.
hist(companies_all$Profit)
The distribution is positively skewed, meaning that there are severe outliers that make the sample biased and, therefore, not suitable for performing inference.
To conclude, after all the above-mentioned discussion, it is possible to state that there is a positive relationship between the economic performance of a company (measured as profit) and the market capitalization. However, the test population is heavily influenced and, thus, biased by the presence of severe outliers which do not allow further investigate unless properly contextualized.
Again, worth of mention, is to see whether the presence of a female CEO affects the relationship.
companies_all %>%
filter(!(Sector %in% c("Food & Drug Stores", "Motor Vehicles & Parts", "Food, Beverages & Tobacco",
"Aerospace & Defense", "Hotels, Restaurants & Leisure", "Engineering & Construction"))) %>%
ggplot() +
aes(x = Profit, y = Market_cap, colour = CEO_woman) +
geom_point(shape = "bullet", size = 2, colour = "#112446") +
geom_smooth(method = "lm", se = FALSE) +
labs(
x = "Profit",
y = "Market capitalization",
title = "Linear relationship between mkt cap and profit"
) +
theme_linedraw()
## `geom_smooth()` using formula = 'y ~ x'
Unfortunately, from the above graph, it seems that a female management negatively affects the corporate market capitalization. However, this result is again biased by the presence of outliers, and, thus, not statistically able to describe a pattern in an unknown population. The final remark could be the investigation of whether and why almost all the anomalous firms are directed by a male CEO.
By looking at the correlation matrix, it seems that a positive relationship is likely to exist between revenues and number of employees in the companies of the sample. In particular, it is reasonable to think that the revenue amount in part depends on the number of employees - hence, on the size of the company. Here it follows a scatter plot of the observations with the line that best fit them.
companies_all %>%
filter(Number_employees >= 0L & Number_employees <= 1400000L) %>%
filter(!(Sector %in%
c("Food, Beverages, Tobacco & Drug Stores", "Motor Vehicles & Parts", "Aerospace & Defense", "Hotels, Restaurants & Leisure",
"Engineering & Construction"))) %>%
ggplot() +
aes(x = Number_employees, y = Revenue) +
geom_point(shape = "circle", size = 1.5, colour = "#112446") +
geom_smooth(method="lm", se=FALSE, span = 0.75) +
labs(
x = "Number of employees",
y = "Revenues",
title = "Employees/Revenues relationship",
subtitle = "Overall analysis"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
As we can see, the positive linear relationship seems to exist, even if it is clear that the observations tend to cluster on the origin of the axis and even if the cloud of points appears quite dispersed. The path of the graph might be due to outliers, or anyway to those limited number of observations that lie on the top/right section of the graph. The same considerations can be confirmed also from the following table, that shows the main statistics of the distributions of the variables under analysis.
companies_all[, c("Revenue", "Number_employees")] %>%
sapply(function(x) round(summary(x), 2)) %>%
datatable(options = list(scrollX = TRUE))
So far, the whole sample has been considered. However, it would be interesting to understand whether the amount of revenues and the number of employees are somehow related to the sector the companies belong to.
companies_all %>%
filter(Number_employees >= 0L & Number_employees <= 1400000L) %>%
filter(!(Sector %in%
c("Food, Beverages, Tobacco & Drug Stores", "Motor Vehicles & Parts", "Aerospace & Defense", "Hotels, Restaurants & Leisure",
"Engineering & Construction"))) %>%
ggplot() +
aes(x = Number_employees, y = Revenue, colour = Sector) +
geom_point(shape = "circle",
size = 2.45) +
scale_color_manual(values = c(`Aerospace & Defense` = "#F8766D", Apparel = "#E8813F",
`Business Services` = "#D88D11", Chemicals = "#C29800", Energy = "#A7A200", `Engineering & Construction` = "#83AB05",
Financials = "#45B21D", `Food, Beverages, Tobacco & Drug Stores` = "#07B935", `Health Care` = "#00BC5D",
`Hotels, Restaurants & Leisure` = "#00BF89", `Household Products` = "#00BFAD", Industrials = "#00BBC9",
Materials = "#05B7E4", Media = "#2DABF0", `Motor Vehicles & Parts` = "#569FFC", Retailing = "#878EFD",
Technology = "#BA7DFC", Telecommunications = "#E06FF2", Transportation = "#EF68DA", Wholesalers = "#FF61C3"
)) +
labs(x = "Number of employees", y = "Revenues", title = "Employees/Revenues relationship ", subtitle = "Sectorial analysis",
color = "Sector") +
theme_minimal()
By looking at this plot, it seems that, according to the sample, there is no such pattern for which companies belonging to a given sector are bigger and have higher revenues.
At this point, it could be worthy to try to build a model that is able, by capturing the pattern between the observation, to predict the revenues a company might achieve according to its number of employees.
regression <- lm(Revenue~Number_employees, data=companies_all)
summary(regression)
##
## Call:
## lm(formula = Revenue ~ Number_employees, data = companies_all)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92092 -7579 -5497 -894 314055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.766e+03 9.295e+02 8.355 <2e-16 ***
## Number_employees 2.857e-01 8.412e-03 33.962 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27810 on 998 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.5357
## F-statistic: 1153 on 1 and 998 DF, p-value: < 2.2e-16
The model is statistically significant (this can be commented by looking at the F-statistic and the related p-value). However, a few consideration should be done. First of all, the model tries to capture a linear relationship between the variables, that might no be the most appropriate one. Secondly, the R-squared statistics - that represents the goodnes of fit of the data - is about 0,53, meaning to say that the model is able to explain a little more than half of the variability of the dependent variable (revenues). The result is not bad per se, but it indicates that the simple linear regression model is not sufficient to make inference and other regressors might be included to provide better predictions. Finally, by looking at the first scatter plot provided, we can see that as long as we move for the bottom-left to the top-right part of the model, the obeservations become more and more dispersed. This might be a first signal of heteroskedasticity, that should be tested and solved with appropriate econometric techniques to create a reliable model.
Based on the following graph, we can observe that the sectors with the highest positive average change were the ones regarding Materials, alongside with Energy and Motor, Vehicles & Parts. This means that in average, the companies within these sectors are having a better ranking compared to previous years, which can translates to better opportunities in the market, concerning either higher demand, higher investment or higher perception that is related to market capitalization.
On the other hand, the sectors that had the highest negative average change were Telecommunications, alongside with Food, Beverages, Tobacco & Drug Stores, and Aerospace & Defense.
It is important to mention that these changes may consider not only specific financial performances from different companies, but also macro-economical and political aspects that could have beneficial or detrimental consequences for a particular sector.
library(ggplot2)
library(esquisse)
library(tidyverse)
Sector_Change = companies_all %>%
select(Sector, Rank_change, Company) %>%
group_by(Sector) %>%
summarise(avg_change = round(mean(Rank_change), 1)) %>%
arrange(desc(avg_change))
ggplot(Sector_Change) +
aes(x = Sector, y = avg_change, fill = avg_change) +
geom_col() +
geom_label(aes(label=avg_change), size=2.5, color="white")+
scale_fill_gradient() +
labs(fill="Average Change")+
coord_flip() +
theme_minimal()+
theme(axis.title.x = element_text(size = 0L))
library(data.table)
library(DT)
library(tidyverse)
question2 = companies_all %>%
select(Sector, Revenue, Profit) %>%
group_by(Sector) %>%
summarise(avg_revenue = round(mean(Revenue),0),avg_profit= round(mean(Profit),0))%>%
arrange(desc(avg_profit))
question2.1 = companies_all %>%
select(Sector, Revenue, Profit, Company) %>%
group_by(Sector) %>%
summarise(profit_margin = round((sum(Profit) / sum(Revenue)*100),2)) %>%
arrange(desc(profit_margin))
question2 %>%
datatable(options = list(scrollX = TRUE))
library(ggplot2)
ggplot(question2) +
aes(x = Sector, y = avg_revenue, fill = avg_revenue, weight = avg_revenue) +
geom_col() +
geom_label(aes(label=avg_revenue), size=2.5, color="white")+
scale_fill_gradient() +
labs(fill="Average Revenue")+
coord_flip() +
theme_minimal()+
theme(axis.title.x = element_text(size = 0L))
ggplot(question2) +
aes(x = Sector, y = avg_profit, fill = avg_profit, weight = avg_profit) +
geom_col() +
geom_label(aes(label=avg_profit), size=2.5, color="white")+
scale_fill_gradient() +
labs(fill="Average Profit")+
coord_flip() +
theme_minimal()+
theme(axis.title.x = element_text(size = 0L))
First, in this table we see that the sectors that drive more revenue in average is the sector of telecommunications. Compared to the second place, this sector has more than 40.000M USD compared to the second place (Technology). Moreover, when talking about profit averages, this sector has the first place as well. This is relevant, however we know that different sectors have different business model, hence the behavior of net income Vs. Net Profit varies depending the sector. On later graphs, we will analyze this aspect more on depth.
question2.1 %>%
datatable(options = list(scrollX = TRUE))
ggplot(question2.1) +
aes(x = Sector, y = profit_margin, fill = profit_margin) +
geom_col() +
geom_label(aes(label=profit_margin), size=2.5, color="white")+
scale_fill_gradient() +
labs(fill="Average Margin")+
coord_flip() +
theme_minimal()+
theme(axis.title.x = element_text(size = 0L))
In order to compare profits among sectors, we decided to calculate the profit margin. The ranking in this case, changes and now the first place belongs to the technology sector. Based on this information, we can conclude that depending on each sector, the costs varies. The sector which more of the revenues get to the profits are Technology, Financials, and Business Services while there are sectors that the costs consume a major part of the revenues, resulting in small profit margins. Some of these sectors are Aerospace & Defense, Retailing and Wholesalers.
In the following graphs, we are analyzing the performance and the relationship between CEO women and the different companies and sectors that we have in our data set.
When analyzing the first graph, regarding the number of CEO women in all the industrial sectors that we are considering, we can observe that the main sectors to have the higher number of CEO women are: Retailing, Financials, Health Care, Energy and Technology.
In contrast, there are several sectors (Wholesalers, Motor, vehicle & Parts, Media, Engineering & Construction, Chemicals, Apparel) where there is only one CEO woman. Furthermore, there is one sector (Telecommunications) that has no companies with a CEO woman. Ironically, this is the sector with the highest revenue and profit average.
library(tidyverse)
library(dplyr)
library(ggplot2)
companies_all$CEO_woman_binary = ifelse(companies_all$CEO_woman, 1, 0)
Sector_Gender_Grouped = companies_all %>%
group_by(Sector) %>%
summarise(Total_CEO_Women = sum(CEO_woman_binary),
Avg_Profit = round(mean(Profit), 0)) %>%
arrange(desc(Total_CEO_Women))
# Graph to visualize the amount of CEO woman in all sectors
ggplot(Sector_Gender_Grouped) +
aes(x = Sector, y = Total_CEO_Women, fill = Total_CEO_Women) +
geom_col() +
geom_label(aes(label=Total_CEO_Women), size=2.5, color="white")+
scale_fill_gradient() +
coord_flip() +
theme_minimal()
Now, if we take a closer look at the top sectors considering the profit average, most of them have a significant number of CEO Woman as shown in the following graph, except for Motor, vehicle & Parts, and Telecommunications.
top5_Sectors = Sector_Gender_Grouped %>%
top_n(5, Avg_Profit)
# Graph to visualize the amount of CEO woman in the top 5 sectors by profit average
ggplot(top5_Sectors) +
aes(x = Sector, y = Avg_Profit, fill = Total_CEO_Women) +
geom_col() +
geom_label(aes(label=Total_CEO_Women), size=3.5, color="white")+
scale_fill_gradient() +
theme_minimal()
Finally, in order to narrow our research, we are considering only the top 5 companies of those 5 main sectors, regarding profit average. When looking at the graph, we can observe that in those 25 companies, there are only 2 companies where you can find CEO women: one in the Financial sector (Citigroup) and the other one in the Motor, vehicle & Parts sector (General Motors). This represents only the 8% of this exclusive sample.
df_top5_Sectors = companies_all %>%
filter(Sector %in% top5_Sectors$Sector)
df_top5_Sectors_Grouped = df_top5_Sectors %>%
group_by(Sector, Company, CEO_woman_binary) %>%
summarise(Avg_Profit = mean(Profit, na.rm = TRUE)) %>%
arrange(Sector, desc(Avg_Profit))
## `summarise()` has grouped output by 'Sector', 'Company'. You can override using
## the `.groups` argument.
df_top5_Companies_grouped = df_top5_Sectors_Grouped %>%
group_by(Sector) %>%
top_n(5,Avg_Profit)
df_top5_Companies_grouped %>%
ggplot() +
aes(
x = Company,
fill = CEO_woman_binary,
colour = Sector,
weight = Avg_Profit
) +
geom_bar() +
scale_x_discrete(labels = c("Comp. 1", "Comp. 2", "Comp. 3", "Comp. 4",
"Comp. 5", "Comp. 6", "Comp. 7", "Comp. 8",
"Comp. 9", "Comp. 10", "Comp. 11", "Comp. 12",
"Comp. 13", "Comp. 14", "Comp. 15", "Comp. 16",
"Comp. 17", "Comp. 18", "Comp. 19", "Comp. 20",
"Comp. 21", "Comp. 22", "Comp. 23", "Comp. 24",
"Comp. 25"))+
scale_fill_gradient() +
scale_color_hue(direction = 1) +
theme_dark()
When looking at the following geographical map per state, we can realize that the states where the most revenue is being collected are California, Texas and New York. Moreover, when looking at the general distribution of revenue, even though California and Texas are located more into the west side of the map, most of the revenue collected is being retained in the east side (considering also New York) .
This is important, taking into account not only general revenue distribution in the US, which can give us an idea of market opportunities in those areas, but which can also help us visualizing possible states development, due to the taxes that are being collected in those places.
library(ggplot2)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(readxl)
question6 = companies_all %>%
select(Revenue, State) %>%
group_by(State) %>%
summarise(revenue = sum(Revenue)) %>%
arrange(desc(revenue))
states = map_data("state")
coordinates = read.csv("~/Desktop/MSc in Business Analytics/Programming for Business Analytics/Neoma_group 2/Hoja de cálculo sin título - Hoja 1.csv")
coordinates_full =inner_join(coordinates, question6, by = c("State" = "State"))
States_Names = read_xlsx("~/Desktop/MSc in Business Analytics/Programming for Business Analytics/Neoma_group 2/States_Name.xlsx")
coordinates_full2 = inner_join(coordinates_full, States_Names, by = c("State" = "State"))
colnames(coordinates_full2)[5] <- "region"
Revenue_Geo = merge(coordinates_full2, states, sort = FALSE, by = "region")
Revenue_Geo = Revenue_Geo[order(Revenue_Geo$order), ]
ggplot(Revenue_Geo, aes(long, lat))+
geom_polygon(aes(group = group, fill = revenue))+
coord_map()
In conclusion, the above project analyzed various aspects of the financial performance of different industrial sectors and their relationship with CEO gender and geographical location among other analysis between Categorical variables VS Cardinal variables, and Cardinal variables Vs Cardinal Variables. The results showed that the sectors with the highest positive average change were Materials, Energy, and Motor, Vehicles & Parts, meaning that in the sample those sector escalated in the revenue raking more than the other, from the previous year, while the sectors with the highest negative average change were Telecommunications, Food, Beverages, Tobacco & Drug Stores, and Aerospace & Defense, marking the trends about sectors that gain relevance on certain historical context. Furthermore, the sector with the highest revenue and profit average was Telecommunications, while the sector with the highest profit margin was Technology.
Regarding the relationship between CEO gender and financial performance, the results showed that the sectors with a higher number of CEO women were Retailing, Financials, Health Care, Energy and Technology. However, when considering the top 5 companies of the 5 main sectors with the highest profit average, there were only 2 companies with CEO women, representing only 8% of the sample.
Finally, the analysis of revenue per state showed that the states with the highest revenue were California (Silicon Valley is a strong driver), Texas, and New York, however, most of the revenue being allocated in the east side of the country.
Overall, this project provides valuable insights into the financial performance of different industrial sectors and the factors that may impact their success. By analyzing the relationship between CEO gender and financial performance, as well as geographical location and revenue, it can help companies and investors make informed decisions about the market opportunities and trends in different industries.