Company Revenue Growth Analysis & Prediction

Introduction

The dataset contains information about 600 e-commerce companies, including their revenue in 2022, employee count, past revenue growth, industry category, and location category.

The objectives of this project are:

To explore the relationship between a company’s employee count and revenue in 2022
To build a predictive model that can assess the likelihood of a company achieving positive revenue growth in the year 2023.

Data Obtain

The dataset is obtained from the source of https://www.statista.com/companies/search?queries%5B%5D=ecommerce.

The basic information such as column names, total row numbers, total column numbers, data types and etc. are identified as followed by steps belows:

# Set CRAN mirror
setwd("D:/MDS project/WQD7004 PROGRAMMING FOR DATA SCIENCE")
df <- read.csv("Revenue Prediction Dataset.csv")

#check data information
head(df,5)

##                                  Name       Location      City           State
## 1           Ajso Global Ecommerce LLC  United States Anchorage          Alaska
## 2             Lcc Best Ecommerce Ltd. United Kingdom   Cardiff South Glamorgan
## 3              Astraza Ecommerce Ltd. United Kingdom     Hayes       Middlesex
## 4 Shanghai Kaijie Ecommerce Co., Ltd.          China  Shanghai        Shanghai
## 5       Grays Ecommerce Group Limited      Australia Moorebank New South Wales
##                                           Industry Revenue.in.2022
## 1                                          Tobacco     $450.9m USD
## 2 Electronic & Telecommunication Equipment & Parts     $418.3m USD
## 3 Electronic & Telecommunication Equipment & Parts     $404.3m USD
## 4              Retail Trade, Except Motor Vehicles     $374.2m USD
## 5                                       E-Commerce     $162.8m USD
##   Last.Revenue.Growth ISO.Code Employees IPO.Status Operating.Status
## 1               0.05%      USD       164    Private           Active
## 2              -2.72%      USD         6    Private           Active
## 3               0.03%      USD         6    Private           Active
## 4               0.41%      USD       825    Private           Active
## 5            1966.76%      USD       652    Private           Active
##   Founding.Year...Est.
## 1                  n/a
## 2                 2018
## 3                 2018
## 4                 2010
## 5                 1992

print(paste("Number of records: ", nrow(df)))

## [1] "Number of records:  600"

print(paste("Number of features: ", ncol(df)))

## [1] "Number of features:  12"

colnames(df)

##  [1] "Name"                 "Location"             "City"                
##  [4] "State"                "Industry"             "Revenue.in.2022"     
##  [7] "Last.Revenue.Growth"  "ISO.Code"             "Employees"           
## [10] "IPO.Status"           "Operating.Status"     "Founding.Year...Est."

summary(df)

##      Name             Location             City              State          
##  Length:600         Length:600         Length:600         Length:600        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    Industry         Revenue.in.2022    Last.Revenue.Growth   ISO.Code        
##  Length:600         Length:600         Length:600          Length:600        
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##   Employees          IPO.Status        Operating.Status   Founding.Year...Est.
##  Length:600         Length:600         Length:600         Length:600          
##  Class :character   Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

str(df)

## 'data.frame':    600 obs. of  12 variables:
##  $ Name                : chr  "Ajso Global Ecommerce LLC" "Lcc Best Ecommerce Ltd." "Astraza Ecommerce Ltd." "Shanghai Kaijie Ecommerce Co., Ltd." ...
##  $ Location            : chr  "United States" "United Kingdom" "United Kingdom" "China" ...
##  $ City                : chr  "Anchorage" "Cardiff" "Hayes" "Shanghai" ...
##  $ State               : chr  "Alaska" "South Glamorgan" "Middlesex" "Shanghai" ...
##  $ Industry            : chr  "Tobacco" "Electronic & Telecommunication Equipment & Parts" "Electronic & Telecommunication Equipment & Parts" "Retail Trade, Except Motor Vehicles" ...
##  $ Revenue.in.2022     : chr  "$450.9m USD" "$418.3m USD" "$404.3m USD" "$374.2m USD" ...
##  $ Last.Revenue.Growth : chr  "0.05%" "-2.72%" "0.03%" "0.41%" ...
##  $ ISO.Code            : chr  "USD" "USD" "USD" "USD" ...
##  $ Employees           : chr  "164" "6" "6" "825" ...
##  $ IPO.Status          : chr  "Private" "Private" "Private" "Private" ...
##  $ Operating.Status    : chr  "Active" "Active" "Active" "Active" ...
##  $ Founding.Year...Est.: chr  "n/a" "2018" "2018" "2010" ...

Data Cleaning

Identify duplicate rows and missing data
Remove the unnecessary column
Convert the missing data are converted from “n/a” to standard representation of missing value in R programming which is “NA”
Separate the categorical data into few parent categories in order to give clearer picture and provide informative insights

#Remove unnecessary column
df1 <- df[,c("Location", "Industry", "Revenue.in.2022","Employees","Last.Revenue.Growth","Founding.Year...Est.")]
head(df1,5)

##         Location                                         Industry
## 1  United States                                          Tobacco
## 2 United Kingdom Electronic & Telecommunication Equipment & Parts
## 3 United Kingdom Electronic & Telecommunication Equipment & Parts
## 4          China              Retail Trade, Except Motor Vehicles
## 5      Australia                                       E-Commerce
##   Revenue.in.2022 Employees Last.Revenue.Growth Founding.Year...Est.
## 1     $450.9m USD       164               0.05%                  n/a
## 2     $418.3m USD         6              -2.72%                 2018
## 3     $404.3m USD         6               0.03%                 2018
## 4     $374.2m USD       825               0.41%                 2010
## 5     $162.8m USD       652            1966.76%                 1992

df1[df1 == "n/a"] <- NA       #change n/a to NA (standard representation of missing value)
colSums(is.na(df1))

##             Location             Industry      Revenue.in.2022 
##                    0                   51                    0 
##            Employees  Last.Revenue.Growth Founding.Year...Est. 
##                    1                    1                  256

#Cleaning for categorical data
#for industry column

df1$Industry.Category <- "Other"  # Default

# Consolidate based on keywords
industry_hierarchy <- list(
  "IT & Computer Services" = c("Computer Programming", "Computer Programming, Consultancy", "Other IT & Computer Services"),
  "Telecommunications" = c("Telecommunications", "Other Telecommunications"),
  "Finance & Insurance" = c("Finance-Related Services, Except Insurance & Pension Funding", "Finance & Insurance", "Finance & Insurance-Related Activities", "Reinsurance", "Trusts, Funds & Similar Financial Entities","Insurance & Pension-Related Services"),
  "Retail Trade" = c("Wholesale Trade, Except Motor Vehicles","Dairy Products","Non-Specialized Wholesale Trade","Retail Trade, Except Motor Vehicles", "Retail Sales via Stalls & Markets", "Other Non-Specialized Stores", "Other Specialized Wholesale Activities","Wholesale & Retail Trade, Including Motor Vehicle Repairs" ),
  "Electrical components" = c("Electronic Components & Boards","Electronic & Telecommunication Equipment & Parts"),
  "E-commerce" = "E-Commerce",
  "Media" = c("Information & Communication","Clothing, Footwear & Leather Articles","Electrical Household Appliances, Furniture, Lighting Equipment & Other Articles","Advertising","Radio Broadcasting","Books, Newspapers & Stationery"),
  "Textiles" = c("Textiles, Clothing & Footwear","Textiles","Other Textiles"),
  "Construction" = c("Real Estate","Construction") ,
  "F&B" = c("Food, Beverages & Tobacco Wholesales","Food","Soft Drinks & Water","Malt Liquors & Malt","Other Food Services","Tobacco"),
  "Transportation & Storage" = c("Postal Services","Transportation & Storage", "Other Transportation Support Activities", "Warehousing & Storage","Courier Services"),
  "Administration" = c("Computer Consultancy & Facilities Management","Administrative & Supportive Services","Management Consultancy"),
  "Professional" = c("Electrical Household Appliances, Furniture, Lighting Equipment & Other Articles","Travel Agencies","Professional, Scientific & Technical Services","Education","Research & Experimental Development on Natural Sciences & Engineering","Pharmaceutical or Medical Goods, Cosmetic & Toilet Articles") ,
  "Other" = "Other"  # Default category for unmatched values
)

# Loop through hierarchy and apply consolidation
for (parent_category in names(industry_hierarchy)) {
  subcategories <- industry_hierarchy[[parent_category]]
  df1$Industry.Category[grepl(paste(subcategories, collapse = "|"), df1$Industry, ignore.case = TRUE)] <- parent_category
}
industry_counts <- table(df1$Industry.Category)
industry_counts

## 
##           Administration             Construction               E-commerce 
##                       59                        7                       64 
##    Electrical components                      F&B      Finance & Insurance 
##                        3                       11                       13 
##   IT & Computer Services                    Media                    Other 
##                       70                       16                      164 
##             Professional             Retail Trade       Telecommunications 
##                        7                      154                       16 
##                 Textiles Transportation & Storage 
##                        6                       10

#For location column
table(df1$Location)

## 
##            Australia              Austria              Bahrain 
##                    5                    1                    1 
##              Belgium               Brazil               Canada 
##                    1                    1                    7 
##                China              Denmark                Egypt 
##                   98                    1                    1 
##              Finland               France              Germany 
##                    2                    1                   80 
##  Hong Kong SAR China                India            Indonesia 
##                    7                  228                    4 
##              Ireland               Israel                Italy 
##                    2                    1                    8 
##               Jordan              Lebanon                Libya 
##                    3                    1                    1 
##             Malaysia            Mauritius              Morocco 
##                    8                    1                    2 
##          Netherlands          New Zealand              Nigeria 
##                    2                    2                    1 
##               Norway          Philippines               Poland 
##                    1                    1                    2 
##                Qatar               Russia         Saudi Arabia 
##                    2                    1                    2 
##            Singapore         South Africa          South Korea 
##                    3                    4                    1 
##                Spain               Sweden               Taiwan 
##                    5                    1                    1 
##             Tanzania             Thailand United Arab Emirates 
##                    1                    4                    3 
##       United Kingdom        United States           Uzbekistan 
##                   43                   46                    1 
##              Vietnam 
##                    8

df1$location.Category <- "Other"  # Default

location_hierarchy <- list(
  "Australia & New Zealand" = c("Australia", "New Zealand"),
  "Europe" = c("Austria", "Belgium", "Denmark", "Finland", "France", "Germany", "Ireland","United Kingdom", 
               "Italy", "Netherlands", "Norway", "Poland", "Spain", "Sweden","Russia") ,
  "Middle East" = c("Bahrain", "Jordan", "Lebanon", "Qatar", "Saudi Arabia", "United Arab Emirates"),
  "North & South America" = c("Brazil", "Canada", "United States"),
  "Asia-Pacific" =c("China", "Hong Kong SAR China", "India", "Indonesia", "Israel", "Japan", 
              "Malaysia", "Philippines", "Singapore", "South Korea", "Taiwan", "Thailand", 
              "Vietnam","Uzbekistan"),
  "Africa" = c("Egypt", "Libya", "Mauritius", "Morocco", "Nigeria", "South Africa", "Tanzania"),
  "Other" = "Other"
)

# Loop through hierarchy and apply consolidation
for (country in names(location_hierarchy)) {
  subcategories <- location_hierarchy[[country]]
  df1$Location.Category[grepl(paste(subcategories, collapse = "|"), df1$Location, ignore.case = TRUE)] <- country
}
location_counts <- table(df1$Location.Category)
location_counts

## 
##                  Africa            Asia-Pacific Australia & New Zealand 
##                      11                     365                       7 
##                  Europe             Middle East   North & South America 
##                     151                      12                      54

After cleaning the categorical data, the numeric data will be cleaned by checking the impact of the variable itself using correlation matrix or based on the percentage of missing value as compared to overall dataset.

#Select only the numeric column for cleaning
Maindf <- df1[,c("Revenue.in.2022","Employees","Last.Revenue.Growth","Founding.Year...Est.")]

#Re-check the data information
print(paste("Number of records: ", nrow(Maindf)))

## [1] "Number of records:  600"

print(paste("Number of features: ", ncol(Maindf)))

## [1] "Number of features:  4"

colnames(Maindf)

## [1] "Revenue.in.2022"      "Employees"            "Last.Revenue.Growth" 
## [4] "Founding.Year...Est."

summary(Maindf)

##  Revenue.in.2022     Employees         Last.Revenue.Growth Founding.Year...Est.
##  Length:600         Length:600         Length:600          Length:600          
##  Class :character   Class :character   Class :character    Class :character    
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character

str(Maindf)

## 'data.frame':    600 obs. of  4 variables:
##  $ Revenue.in.2022     : chr  "$450.9m USD" "$418.3m USD" "$404.3m USD" "$374.2m USD" ...
##  $ Employees           : chr  "164" "6" "6" "825" ...
##  $ Last.Revenue.Growth : chr  "0.05%" "-2.72%" "0.03%" "0.41%" ...
##  $ Founding.Year...Est.: chr  NA "2018" "2018" "2010" ...

# Count the number of duplicate rows
num_duplicates <- sum(duplicated(df))
print(num_duplicates)

## [1] 0

#check missing data
any_missing <- any(Maindf == "n/a", na.rm = TRUE)
Maindf[Maindf == "n/a"] <- NA       #change n/a to NA (standard representation of missing value)
colSums(is.na(Maindf))              #count number of NA per column

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                    0                    1                    1 
## Founding.Year...Est. 
##                  256

colSums(!is.na(Maindf))             #check number of filled values per column

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                  600                  599                  599 
## Founding.Year...Est. 
##                  344

is.null(Maindf)                    #check nothingness

## [1] FALSE

Since there are only 1 rows of Employees and 1 rows of Last.Revenue.Growth are having missing values, which is less than 10% of the overall dataset, both rows will be removed.

#remove missing column of employees and revenue
clean_Employees <- Maindf[complete.cases(Maindf$Employees), ]
print(sum(is.na(clean_Employees$Employees)))

## [1] 0

print(sum(!is.na(clean_Employees$Employees)))

## [1] 599

clean_Revenue <- clean_Employees[complete.cases(clean_Employees$Last.Revenue.Growth), ]
clean_Maindf<- clean_Revenue
print(sum(is.na(clean_Maindf$Last.Revenue.Growth)))

## [1] 0

print(sum(!is.na(clean_Maindf$Last.Revenue.Growth)))

## [1] 598

summary(clean_Maindf)

##  Revenue.in.2022     Employees         Last.Revenue.Growth Founding.Year...Est.
##  Length:598         Length:598         Length:598          Length:598          
##  Class :character   Class :character   Class :character    Class :character    
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character

colSums(is.na(clean_Maindf))

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                    0                    0                    0 
## Founding.Year...Est. 
##                  256

colSums(!is.na(clean_Maindf))

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                  598                  598                  598 
## Founding.Year...Est. 
##                  342

There are 256/600 missing values for Founding.Year…Est. Hence, the data frame is converted from character to numeric in order to perform the impact assessment using correlation matrix. Meanwhile, all “NA” rows under Founding.Year…Est. will be temporary eliminated during correlation matrix calculation.

#Before calculation, need to convert all from character to numeric
unique(clean_Maindf$Employees)  #to check why unable to direct convert

##   [1] "164"   "6"     "825"   "652"   "295"   "296"   "290"   "376"   "292"  
##  [10] "291"   "293"   "299"   "289"   "294"   "391"   "298"   "297"   "22"   
##  [19] "196"   "135"   "369"   "467"   "46"    "891"   "261"   "924"   "16"   
##  [28] "97"    "112"   "94"    "313"   "184"   "110"   "98"    "27"    "56"   
##  [37] "14"    "29"    "99"    "510"   "60"    "109"   "23"    "89"    "26"   
##  [46] "36"    "1"     "964"   "146"   "88"    "63"    "31"    "21"    "191"  
##  [55] "337"   "301"   "20"    "19"    "13"    "12"    "310"   "216"   "58"   
##  [64] "148"   "86"    "70"    "5"     "50"    "49"    "134"   "118"   "121"  
##  [73] "136"   "87"    "199"   "214"   "41"    "47"    "55"    "40"    "193"  
##  [82] "44"    "24"    "11"    "236"   "17"    "157"   "3"     "96"    "176"  
##  [91] "38"    "188"   "79"    "69"    "66"    "67"    "42"    "30"    "25"   
## [100] "65"    "150"   "18"    "151"   "2"     "77"    "53"    "155"   "4"    
## [109] "28"    "10"    "9"     "271"   "75"    "37"    "34"    "33"    "7"    
## [118] "35"    "113"   "102"   "82"    "32"    "81"    "115"   "103"   "78"   
## [127] "51"    "95"    "83"    "54"    "325"   "15"    "39"    "93"    "92"   
## [136] "541"   "48"    "8"     "43"    "64"    "1,311" "59"    "57"    "71"

sapply(clean_Maindf, class)

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##          "character"          "character"          "character" 
## Founding.Year...Est. 
##          "character"

str(clean_Maindf)

## 'data.frame':    598 obs. of  4 variables:
##  $ Revenue.in.2022     : chr  "$450.9m USD" "$418.3m USD" "$404.3m USD" "$374.2m USD" ...
##  $ Employees           : chr  "164" "6" "6" "825" ...
##  $ Last.Revenue.Growth : chr  "0.05%" "-2.72%" "0.03%" "0.41%" ...
##  $ Founding.Year...Est.: chr  NA "2018" "2018" "2010" ...

# Revenue in 2022: Remove "$" and "m USD" and convert to numeric
clean_Maindf$Revenue.in.2022 <- as.numeric(substr(clean_Maindf$Revenue.in.2022, 2, nchar(clean_Maindf$Revenue.in.2022) - 6))
# Last Revenue Growth: Remove "%"
clean_Maindf$Last.Revenue.Growth <- as.numeric(substr(clean_Maindf$Last.Revenue.Growth, 1, nchar(clean_Maindf$Last.Revenue.Growth) - 1))
# Other: normal convert
clean_Maindf$Founding.Year...Est. <- as.numeric(clean_Maindf$Founding.Year...Est.)
clean_Maindf$Employees <- as.numeric(gsub(",", "", clean_Maindf$Employees))
clean_Maindf$Employees <- as.numeric(clean_Maindf$Employees)
colSums(is.na(clean_Maindf))

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                    0                    0                    0 
## Founding.Year...Est. 
##                  256

colSums(!is.na(clean_Maindf))

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                  598                  598                  598 
## Founding.Year...Est. 
##                  342

#For Funding Years, to check whether the variable is impact on revenue, correlation matrix is calculated after eliminate all NA column.
#Clean NA rows for Founding year
clean_Founding <- clean_Maindf[complete.cases(clean_Maindf$Founding.Year...Est.), ]
print(sum(is.na(clean_Founding$Founding.Year...Est.)))

## [1] 0

print(sum(!is.na(clean_Founding$Founding.Year...Est.)))

## [1] 342

colSums(is.na(clean_Founding))

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                    0                    0                    0 
## Founding.Year...Est. 
##                    0

colSums(!is.na(clean_Founding))

##      Revenue.in.2022            Employees  Last.Revenue.Growth 
##                  342                  342                  342 
## Founding.Year...Est. 
##                  342

head(clean_Founding,5)

##   Revenue.in.2022 Employees Last.Revenue.Growth Founding.Year...Est.
## 2             418         6               -2.72                 2018
## 3             404         6                0.03                 2018
## 4             374       825                0.41                 2010
## 5             162       652             1966.76                 1992
## 6             150       295              -10.61                 2019

# Correlation Matrix Calculation
cor_matrix <- cor(clean_Founding)
print(cor_matrix)

##                      Revenue.in.2022   Employees Last.Revenue.Growth
## Revenue.in.2022            1.0000000  0.41195643           0.1636159
## Employees                  0.4119564  1.00000000           0.1997368
## Last.Revenue.Growth        0.1636159  0.19973677           1.0000000
## Founding.Year...Est.       0.0363452 -0.07226679          -0.1898559
##                      Founding.Year...Est.
## Revenue.in.2022                0.03634520
## Employees                     -0.07226679
## Last.Revenue.Growth           -0.18985586
## Founding.Year...Est.           1.00000000

Based on Correlation Matrix, Founding year have no correlation with any of the other value. Hence, by considering it consists of high number of missing value, the entire column is decided to be removed and new data frame is created.

#New data frame after remove unnecessary columns.
Final_df <- clean_Maindf[,c("Revenue.in.2022","Employees","Last.Revenue.Growth")]
head(Final_df,5)

##   Revenue.in.2022 Employees Last.Revenue.Growth
## 1             450       164                0.05
## 2             418         6               -2.72
## 3             404         6                0.03
## 4             374       825                0.41
## 5             162       652             1966.76

sapply(Final_df, class)

##     Revenue.in.2022           Employees Last.Revenue.Growth 
##           "numeric"           "numeric"           "numeric"

summary(Final_df)

##  Revenue.in.2022    Employees       Last.Revenue.Growth
##  Min.   :  2.00   Min.   :   1.00   Min.   : -50.280   
##  1st Qu.:  3.00   1st Qu.:  17.00   1st Qu.:  -4.407   
##  Median :  4.00   Median :  26.00   Median :   0.785   
##  Mean   : 15.38   Mean   :  64.91   Mean   :   4.736   
##  3rd Qu.: 11.00   3rd Qu.:  58.00   3rd Qu.:   6.570   
##  Max.   :450.00   Max.   :1311.00   Max.   :1966.760

colSums(is.na(Final_df))

##     Revenue.in.2022           Employees Last.Revenue.Growth 
##                   0                   0                   0

colSums(!is.na(Final_df))

##     Revenue.in.2022           Employees Last.Revenue.Growth 
##                 598                 598                 598

Effectiveness of Data Cleaning will be check using Histogram for skewness and BOX-plot for outlier. For skewness, due to both columns of Employees and Revenue.in.2022 shows highly positive skewness towards right side, the log-transformation method have been used to transform the data towards more symmetric result. By comparing the result of original data frame and log-transformed data frame,the similar results shows least impact of log-transform to correlation matrix.Hence, the transformation is acceptable and will not impact on prediction results.

# Check data types
class(Final_df$Employees)

## [1] "numeric"

# Convert to numeric
#your_data$variable <- as.numeric(your_data$variable)

# Install and load the moments package if not already installed

# Raw data skewness
skew_bf_Emp <- skewness(Final_df$Employees)
skew_bf_Rev <- skewness(Final_df$Revenue.in.2022)
print(paste("Original Skewness of Employees:",skew_bf_Emp))

## [1] "Original Skewness of Employees: 5.22762327599016"

print(paste("Original Skewness of Revenue in 2022:",skew_bf_Rev))

## [1] "Original Skewness of Revenue in 2022: 6.77879926639424"

# Log transformation
Final_df$Employees_log <- log(Final_df$Employees)
Final_df$Revenue.in.2022_log <- log(Final_df$Revenue.in.2022)

# Check the skewness after transformation
skew_af_Emp <- skewness(Final_df$Employees_log)
skew_af_Rev <- skewness(Final_df$Revenue.in.2022_log)
print(paste("Log-Transformed Skewness of Employees:",skew_af_Emp))

## [1] "Log-Transformed Skewness of Employees: 0.338800603770846"

print(paste("Log-Transformed Skewness of Revenue in 2022:",skew_af_Rev))

## [1] "Log-Transformed Skewness of Revenue in 2022: 1.45478092705567"

# Create histograms for both the original and log-transformed variables
par(mfrow=c(1,2))  # Set up a 1x2 grid for side-by-side plots

# Original variable
hist(Final_df$Revenue.in.2022, 
     main = "Histogram of Original",
     xlab = "Revenue.in.2022",
     col = "lightblue",
     border = "black")

# Transformed variable
hist(log(Final_df$Revenue.in.2022), 
     main = "Histogram of Log-Transformed",
     xlab = "log(Revenue.in.2022)",
     col = "lightgreen",
     border = "black")

# Original variable
hist(Final_df$Employees, 
     main = "Histogram of Original",
     xlab = "Employees",
     col = "lightblue",
     border = "black")

# Transformed variable
hist(log(Final_df$Employees), 
     main = "Histogram of Log-Transformed",
     xlab = "log(Employees)",
     col = "lightgreen",
     border = "black")

# Correlation Matrix Calculation
cor_matrix <- cor(Final_df)
print(cor_matrix)

##                     Revenue.in.2022 Employees Last.Revenue.Growth Employees_log
## Revenue.in.2022           1.0000000 0.4200475           0.1490019     0.3410082
## Employees                 0.4200475 1.0000000           0.2009087     0.7411684
## Last.Revenue.Growth       0.1490019 0.2009087           1.0000000     0.1095717
## Employees_log             0.3410082 0.7411684           0.1095717     1.0000000
## Revenue.in.2022_log       0.7358805 0.5179241           0.1267887     0.5262732
##                     Revenue.in.2022_log
## Revenue.in.2022               0.7358805
## Employees                     0.5179241
## Last.Revenue.Growth           0.1267887
## Employees_log                 0.5262732
## Revenue.in.2022_log           1.0000000

#The results shows least impact of log-transform to correlation matrix.The transformation is acceptable.

Next, outlier have been verified using BOX-plot for log-values. By using the range of 3.0 (range: 1.5-3.0 for normal to moderately skeness data), the results shows there are outliers existing in the data frame for column Last.Revenue.Growth. Hence, the outliers are removed and new BOX-plot after outlier removal is plotted.

head (Final_df,5)

##   Revenue.in.2022 Employees Last.Revenue.Growth Employees_log
## 1             450       164                0.05      5.099866
## 2             418         6               -2.72      1.791759
## 3             404         6                0.03      1.791759
## 4             374       825                0.41      6.715383
## 5             162       652             1966.76      6.480045
##   Revenue.in.2022_log
## 1            6.109248
## 2            6.035481
## 3            6.001415
## 4            5.924256
## 5            5.087596

#BOX plot for checking outliers
par(mar = c(5, 8, 4, 2))
Final_df_log1 <- Final_df[, c("Revenue.in.2022_log", "Employees_log", "Last.Revenue.Growth")]
boxplot(Final_df_log1, col = c("lightblue", "lightgreen", "lightpink"), horizontal = TRUE, main = "Horizontal Box Plot", range = 3.0)

#Remove outliers
outliers_col <- boxplot.stats(Final_df_log1$Last.Revenue.Growth)$out
outliers <- unique(c(outliers_col))
outliers

## [1] 1966.76  213.08  -50.28

Final_df_log <- Final_df_log1[!Final_df_log1$Last.Revenue.Growth %in% outliers, ]

#BOX plot after remove outliers
boxplot(Final_df_log, col = c("lightblue", "lightgreen", "lightpink"), horizontal = TRUE, main = "Horizontal Box Plot (Outliers Removed)", range = 3.0)

head(Final_df,5)

##   Revenue.in.2022 Employees Last.Revenue.Growth Employees_log
## 1             450       164                0.05      5.099866
## 2             418         6               -2.72      1.791759
## 3             404         6                0.03      1.791759
## 4             374       825                0.41      6.715383
## 5             162       652             1966.76      6.480045
##   Revenue.in.2022_log
## 1            6.109248
## 2            6.035481
## 3            6.001415
## 4            5.924256
## 5            5.087596

head(Final_df_log,5)

##   Revenue.in.2022_log Employees_log Last.Revenue.Growth
## 1            6.109248      5.099866                0.05
## 2            6.035481      1.791759               -2.72
## 3            6.001415      1.791759                0.03
## 4            5.924256      6.715383                0.41
## 6            5.010635      5.686975              -10.61

After cleaning all the numeric data, the dataset is combine again with the cleaned categorical data.

Final_df_log <- cbind(Final_df_log, Industry.Category = df1$Industry.Category[match(rownames(Final_df_log), rownames(df1))])
head(Final_df_log,5)

##   Revenue.in.2022_log Employees_log Last.Revenue.Growth     Industry.Category
## 1            6.109248      5.099866                0.05                   F&B
## 2            6.035481      1.791759               -2.72 Electrical components
## 3            6.001415      1.791759                0.03 Electrical components
## 4            5.924256      6.715383                0.41          Retail Trade
## 6            5.010635      5.686975              -10.61                 Other

Final_df_log <- cbind(Final_df_log, Location.Category = df1$Location.Category[match(rownames(Final_df_log), rownames(df1))])
head(Final_df_log,5)

##   Revenue.in.2022_log Employees_log Last.Revenue.Growth     Industry.Category
## 1            6.109248      5.099866                0.05                   F&B
## 2            6.035481      1.791759               -2.72 Electrical components
## 3            6.001415      1.791759                0.03 Electrical components
## 4            5.924256      6.715383                0.41          Retail Trade
## 6            5.010635      5.686975              -10.61                 Other
##       Location.Category
## 1 North & South America
## 2                Europe
## 3                Europe
## 4          Asia-Pacific
## 6                Europe

The data frame : Final_df_log, will be used for next process onwards.

Exploratory Data Analytics (EDA)

Understanding Dataset

summary statistic to get an overview of the central tendencies.

# Check column names
colnames(Final_df_log)

## [1] "Revenue.in.2022_log" "Employees_log"       "Last.Revenue.Growth"
## [4] "Industry.Category"   "Location.Category"

# summarization of all columns
summary(Final_df_log)

##  Revenue.in.2022_log Employees_log   Last.Revenue.Growth Industry.Category 
##  Min.   :0.6931      Min.   :0.000   Min.   :-16.440     Length:595        
##  1st Qu.:1.0986      1st Qu.:2.833   1st Qu.: -4.405     Class :character  
##  Median :1.3863      Median :3.258   Median :  0.780     Mode  :character  
##  Mean   :1.7724      Mean   :3.448   Mean   :  1.181                       
##  3rd Qu.:2.3979      3rd Qu.:4.052   3rd Qu.:  6.545                       
##  Max.   :6.1092      Max.   :7.179   Max.   : 21.880                       
##  Location.Category 
##  Length:595        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Industry and Location data are attributed as data with values within categorical variables. Meanwhile, Revenue.in.2022_log, Employees_log, Last.Revenue.Growth are attributed as data with values within numeric variables. Based on this information, it is evident that all the data can be categorized as structured data, as both the categorical and numerical data belong to this classification.

Understanding Numeric Variables

Box Plot

#download necessary libraries
# Ensure only numeric columns
numeric_cols <- Final_df_log[, sapply(Final_df_log, is.numeric)]

#BOX plot
boxplot(numeric_cols, col = c("lightblue", "lightgreen", "lightpink"), horizontal = TRUE, main = "Horizontal Box Plot", range = 3.0)

2. Histogram

par(mfrow = c(1, 3))

# Create histogram for Revenue.in.2022_log 
hist(Final_df_log$Revenue.in.2022_log, 
     main = "Histogram of Revenue.in.2022",
     xlab = "Revenue",
     col = "skyblue") 

# Create histogram for Employees_log 
hist(Final_df_log$Employees_log,
     main = "Histogram of Employees",
     xlab = "Employees",
     col = "lightgreen")  

# Create histogram for Employees_log with a specific color
hist(Final_df_log$Last.Revenue.Growth,
     main = "Histogram of Revenue Growth",
     xlab = "Revenue Growth",
     col = "pink")

Revenue.in.2022 has less variance compared to Last.Revenue.Growth, however the central tendency Revenue.in.2022 is more skewed compared to Last.Revenue.Growth.

Understanding Categorical Variables

#download necessary libraries
# Create a table for counts of Industry.Category
industry_counts <- table(Final_df_log$Industry.Category)

# Create a table for counts of Location.Category
location_counts <- table(Final_df_log$Location.Category)

# Create data frames for ggplot
industry_data <- data.frame(Var1 = names(industry_counts), Freq = as.vector(industry_counts))
location_data <- data.frame(Var1 = names(location_counts), Freq = as.vector(location_counts))

# Plotting individual bar charts for Industry and Location categories using ggplot2
p1 <- ggplot(industry_data, aes(x = reorder(Var1, Freq), y = Freq)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = Freq), vjust = -0.5, size = 3, color = "black") +
  labs(title = "Frequency of Industry Categories", x = "Industry Category", y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

p2 <- ggplot(location_data, aes(x = reorder(Var1, Freq), y = Freq)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  geom_text(aes(label = Freq), vjust = -0.5, size = 3, color = "black") +
  labs(title = "Frequency of Location Categories", x = "Location Category", y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Displaying graphs side by side using grid.arrange()
grid.arrange(p1, p2, ncol = 2)

The dataset consist of multiple industries mainly from Asia-Pacific.

Analyzing Data

To identify the dependent variable as Revenue.in.2022_log or Last.Revenue.Growth by the end of the process.
To break down the process to Numerical vs. Numerical and Categorical vs. Numerical Correlation sincethe data is structured data.

Numerical vs. Numerical Correlation

Scatter Plot

# Create a layout with two plots side by side
par(mfrow = c(1, 2))

# Scatter plot for Revenue in 2022 vs. Employees with trendline
plot(Final_df_log$Revenue.in.2022_log, Final_df_log$Employees_log,
     xlab = "Revenue in 2022", ylab = "Employees",
     main = "Employees vs. Revenue in 2022")

# Fit a linear regression line and add it to the plot
fit <- lm(Employees_log ~ Revenue.in.2022_log, data = Final_df_log)
abline(fit, col = "red")

# Scatter plot for Employees vs. Last Revenue Growth with trendline
plot(Final_df_log$Last.Revenue.Growth, Final_df_log$Employees_log,
     xlab = "Last Revenue Growth", ylab = "Employees",
     main = "Employees vs. Last Revenue Growth")

# Fit a linear regression line and add it to the plot
fit_last_revenue <- lm(Employees_log ~ Last.Revenue.Growth, data = Final_df_log)
abline(fit_last_revenue, col = "blue")

correlation matrix

correlation_matrix <- cor(Final_df_log[, c("Revenue.in.2022_log", "Employees_log", "Last.Revenue.Growth")])

# Visualize correlation matrix as a heatmap with values inside

# Convert correlation matrix to long format for plotting
correlation_melted <- melt(correlation_matrix)

# Plot heatmap of the correlation matrix with values inside
ggplot(correlation_melted, aes(Var1, Var2, fill = value, label = round(value, 2))) +
  geom_tile(color = "white") +
  geom_text(color = "black") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Correlation Heatmap with Values") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_fixed()

There are high correlation between Revenue.in.2022_log vs. Employees_log while the Last.Revenue.Growth vs. Employees_log has low correlation.

Categorical vs. Numerical Correlation

Correlation analysis for Categorical vs. Numerical is not straight forward as Numerical vs. Numerical.Firstly breakdown the categories and find correlation of the breakdowns to further understand how Revenues are contributing to different parts of categories.

# Unique industry categories
unique_categories <- unique(Final_df_log$Industry.Category)

# Create an empty list to store correlations for each category
correlation_list <- list()

#loop
for (category in unique_categories) {
  # Subset data for each category
  subset_data <- subset(Final_df_log, Industry.Category == category)
  
  # Calculate correlation between 'Revenue.in.2022_log' and other variables in the subset
  correlation <- cor.test(subset_data$Revenue.in.2022_log, subset_data$Employees_log)
  
  # Store correlation information in the list
  correlation_list[[category]] <- correlation
}

# Convert the correlation estimates and categories to a data frame
correlation_df <- data.frame(Category = unique_categories,
                             Correlation = sapply(correlation_list, function(x) x$estimate))

# Sort the data frame by Correlation from highest to lowest
correlation_df <- correlation_df[order(-correlation_df$Correlation), ]

# Print correlations from highest to lowest
for (i in 1:nrow(correlation_df)) {
  cat("Correlation for", correlation_df$Category[i], ":", correlation_df$Correlation[i], "\n")
}

## Correlation for Textiles : 0.9720721 
## Correlation for Professional : 0.7741121 
## Correlation for Other : 0.7206766 
## Correlation for Construction : 0.6734928 
## Correlation for Finance & Insurance : 0.6247985 
## Correlation for F&B : 0.5824068 
## Correlation for IT & Computer Services : 0.5681233 
## Correlation for E-commerce : 0.546721 
## Correlation for Retail Trade : 0.5183457 
## Correlation for Telecommunications : 0.5016474 
## Correlation for Administration : 0.371465 
## Correlation for Media : 0.1360945 
## Correlation for Transportation & Storage : -0.09883522 
## Correlation for Electrical components : -0.9999847

# Unique location categories
unique_locations <- unique(Final_df_log$Location.Category)

# Create an empty list to store correlations for each location category
correlation_list_location <- list()

# Loop through each unique location category
for (location in unique_locations) {
  # Subset data for each location category
  subset_data <- subset(Final_df_log, Location.Category == location)
  
  # Calculate correlation between 'Revenue.in.2022_log' and 'Employees_log' in the subset
  correlation <- cor.test(subset_data$Revenue.in.2022_log, subset_data$Employees_log)
  
  # Store correlation information in the list
  correlation_list_location[[location]] <- correlation
}

# Convert the correlation estimates and categories to a data frame for locations
correlation_df_location <- data.frame(Location = unique_locations,
                                      Correlation = sapply(correlation_list_location, function(x) x$estimate))

# Sort the data frame by Correlation from highest to lowest for locations
correlation_df_location <- correlation_df_location[order(-correlation_df_location$Correlation), ]

# Print correlations from highest to lowest for locations
for (i in 1:nrow(correlation_df_location)) {
  cat("Correlation for", correlation_df_location$Location[i], ":", correlation_df_location$Correlation[i], "\n")
}

## Correlation for Middle East : 0.7177035 
## Correlation for North & South America : 0.7168009 
## Correlation for Australia & New Zealand : 0.5966943 
## Correlation for Asia-Pacific : 0.5756662 
## Correlation for Europe : 0.5609685 
## Correlation for Africa : 0.5587881

The best way to visualize this data is to use the bar chart since there is a lot of industry breakdowns

# Bar chart for correlation_df (Industry.Category)
plot_correlation_df <- ggplot(correlation_df, aes(x = reorder(Category, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = round(Correlation, 2)), size = 2, color = "black") +
  labs(title = "Revenue vs Industry Correlation",
       x = "Industry Category", y = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Bar chart for correlation_df_location (Location.Category)
plot_correlation_df_location <- ggplot(correlation_df_location, aes(x = reorder(Location, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  geom_text(aes(label = round(Correlation, 2)),  size = 3, color = "black") +
  labs(title = "Revenue vs Location Correlation",
       x = "Location Category", y = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Arrange plots side by side on the same row
plot_correlation_df + plot_correlation_df_location +
  plot_layout(ncol = 2)

The most have positive correlations with the Revenue.in.2022_log.

Next analyze with Last.Revenue.Growth as the dependent variable and explore the correlations between Industry and Locations.

# Unique industry categories
unique_categories <- unique(Final_df_log$Industry.Category)

# Create an empty list to store correlations for each category
correlation_list_growth <- list()

# Loop through each unique category
for (category in unique_categories) {
  # Subset data for each category
  subset_data <- subset(Final_df_log, Industry.Category == category)
  
  # Calculate correlation between 'Last.Revenue.Growth' and 'Employees_log' in the subset
  correlation <- cor.test(subset_data$Last.Revenue.Growth, subset_data$Employees_log)
  
  # Store correlation information in the list
  correlation_list_growth[[category]] <- correlation
}

# Convert the correlation estimates and categories to a data frame for Last.Revenue.Growth
correlation_df_growth <- data.frame(Category = unique_categories,
                                    Correlation = sapply(correlation_list_growth, function(x) x$estimate))

# Sort the data frame by Correlation from highest to lowest for Last.Revenue.Growth
correlation_df_growth <- correlation_df_growth[order(-correlation_df_growth$Correlation), ]

# Print correlations from highest to lowest for Last.Revenue.Growth
for (i in 1:nrow(correlation_df_growth)) {
  cat("Correlation for", correlation_df_growth$Category[i], ":", correlation_df_growth$Correlation[i], "\n")
}

## Correlation for Electrical components : 0.8818947 
## Correlation for Telecommunications : 0.2229753 
## Correlation for Construction : 0.1277163 
## Correlation for Transportation & Storage : 0.0847529 
## Correlation for Administration : 0.07614184 
## Correlation for Professional : 0.06467406 
## Correlation for IT & Computer Services : -0.00192065 
## Correlation for Retail Trade : -0.01463785 
## Correlation for Other : -0.1103358 
## Correlation for E-commerce : -0.1374407 
## Correlation for Textiles : -0.2457637 
## Correlation for Media : -0.3179541 
## Correlation for F&B : -0.6200289 
## Correlation for Finance & Insurance : -0.6678347

# Unique location categories
unique_locations <- unique(Final_df_log$Location.Category)

# Create an empty list to store correlations for each location category
correlation_list_location_growth <- list()

# Loop through each unique location category
for (location in unique_locations) {
  # Subset data for each location category
  subset_data <- subset(Final_df_log, Location.Category == location)
  
  # Calculate correlation between 'Last.Revenue.Growth' and 'Employees_log' in the subset
  correlation <- cor.test(subset_data$Last.Revenue.Growth, subset_data$Employees_log)
  
  # Store correlation information in the list
  correlation_list_location_growth[[location]] <- correlation
}

# Convert the correlation estimates and categories to a data frame for locations
correlation_df_location_growth <- data.frame(Location = unique_locations,
                                             Correlation = sapply(correlation_list_location_growth, function(x) x$estimate))

# Sort the data frame by Correlation from highest to lowest for locations
correlation_df_location_growth <- correlation_df_location_growth[order(-correlation_df_location_growth$Correlation), ]

# Print correlations from highest to lowest for locations
for (i in 1:nrow(correlation_df_location_growth)) {
  cat("Correlation for", correlation_df_location_growth$Location[i], ":", correlation_df_location_growth$Correlation[i], "\n")
}

## Correlation for Africa : 0.374507 
## Correlation for Middle East : 0.1808134 
## Correlation for Asia-Pacific : 0.04474346 
## Correlation for Europe : -0.164285 
## Correlation for Australia & New Zealand : -0.1943485 
## Correlation for North & South America : -0.3390054

bar charts.

# Bar chart for correlation_df_growth (Industry.Category)
plot_correlation_df_growth <- ggplot(correlation_df_growth, aes(x = reorder(Category, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = round(Correlation, 2)), size = 2, color = "black") +
  labs(title = "Last Revenue Growth vs Industry",
       x = "Industry Category", y = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Bar chart for correlation_df_location_growth (Location.Category)
plot_correlation_df_location_growth <- ggplot(correlation_df_location_growth, aes(x = reorder(Location, Correlation), y = Correlation)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  geom_text(aes(label = round(Correlation, 2)),  size = 3, color = "black") +
  labs(title = "Last Revenue Growth vs Location",
       x = "Location Category", y = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Arrange plots side by side on the same row
plot_correlation_df_growth + plot_correlation_df_location_growth +
  plot_layout(ncol = 2)

There are quite a handful of correlation outcomes that has less than 0 correlation.

Summary

This section focused on the correlations between dependent variables and independent variables and here are few key takeaways that can consider to structure the models:

Revenue.in.2022_log is suggested to be the main dependent variable as it has high correlation to a lot of independent variables comparatively with Last.Revenue.Growth. However, few techniques need to be applied to cater potential skewness challenges. Regression modeling can be the main model for the project, and using Employees as the independent variable can produce a much more straightforward result.

Regression Model

This section addresses Objective 1: exploring the relationship between employee count and revenue. The dependent variable is Revenue.in.2022_log (log-transformed revenue), chosen because it is a continuous numeric measure of financial performance. Employees_log is used as the sole independent variable to keep the relationship interpretable. A linear regression is appropriate here since both variables are numeric and we are estimating the magnitude of the relationship.

Split Data

To split the data into training sets and test sets.

#Load necessary packages
# Split the data into training and testing sets
set.seed(123)
# Create a data partition for training and testing
train_index <- createDataPartition(Final_df_log$Revenue.in.2022_log, p = 0.8, list = FALSE)

# Extract the training and testing data
train_data <- Final_df_log[train_index, ]
test_data <- Final_df_log[-train_index, ]

Data Modeling

# Create a simple linear regression model with "Employees_log" as the only independent variable
model <- lm(Revenue.in.2022_log ~ Employees_log, data = train_data)

# Summarize the linear regression model
model_summary <- summary(model)

# Print the model summary
print(model_summary)

## 
## Call:
## lm(formula = Revenue.in.2022_log ~ Employees_log, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6931 -0.6510 -0.2719  0.4414  5.1399 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.06769    0.14392   -0.47    0.638    
## Employees_log  0.53763    0.03956   13.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9576 on 477 degrees of freedom
## Multiple R-squared:  0.2791, Adjusted R-squared:  0.2776 
## F-statistic: 184.7 on 1 and 477 DF,  p-value: < 2.2e-16

# Visualize the relationship between Employees_log and Revenue.in.2022_log
plot(train_data$Employees_log, train_data$Revenue.in.2022_log, main = "Scatterplot of Employees_log vs. Revenue.in.2022_log", 
     xlab = "Employees_log", ylab = "Revenue.in.2022_log", col = "blue", pch = 16)

# Add the regression line to the scatterplot
abline(model, col = "red")

### Test Result

# Predictions on the test set
predictions <- predict(model, newdata = test_data)

# Model Evaluation on the test set
# Calculate Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)
rmse <- sqrt(mean((predictions - test_data$Revenue.in.2022_log)^2))
mae <- mean(abs(predictions - test_data$Revenue.in.2022_log))

# Print evaluation metrics
cat("Root Mean Squared Error (RMSE):", rmse, "\n")

## Root Mean Squared Error (RMSE): 0.7910697

cat("Mean Absolute Error (MAE):", mae, "\n")

## Mean Absolute Error (MAE): 0.6073597

# Additional Model Evaluation Metrics
# Calculate R-squared
rsquared <- summary(model)$r.squared
cat("R-squared:", rsquared, "\n")

## R-squared: 0.2790829

# Calculate Adjusted R-squared
adj_rsquared <- summary(model)$adj.r.squared
cat("Adjusted R-squared:", adj_rsquared, "\n")

## Adjusted R-squared: 0.2775716

Analyze Results

# Residual Analysis: Plot residuals vs. fitted values
residuals <- residuals(model)
fitted_values <- fitted(model)
plot(fitted_values, residuals, main = "Residuals vs. Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals", col = "purple", pch = 16)

# Perform Shapiro-Wilk test for normality of residuals
shapiro_test <- shapiro.test(residuals)
cat("Shapiro-Wilk test p-value:", shapiro_test$p.value, "\n")

## Shapiro-Wilk test p-value: 5.673278e-17

Classification Model

This section addresses Objective 2: predicting whether a company will achieve positive revenue growth in 2023. Unlike the regression above, the target here is Last.Revenue.Growth converted into a binary outcome — 1 if growth was positive, 0 otherwise. Classification is used instead of regression because the business question is a yes/no decision, not a precise growth estimate. The same Employees_log predictor is used to maintain consistency with the regression analysis.

Data Preparation

Ensure the dataset is clean and does not have missing values.

head(Final_df_log, 5)

##   Revenue.in.2022_log Employees_log Last.Revenue.Growth     Industry.Category
## 1            6.109248      5.099866                0.05                   F&B
## 2            6.035481      1.791759               -2.72 Electrical components
## 3            6.001415      1.791759                0.03 Electrical components
## 4            5.924256      6.715383                0.41          Retail Trade
## 6            5.010635      5.686975              -10.61                 Other
##       Location.Category
## 1 North & South America
## 2                Europe
## 3                Europe
## 4          Asia-Pacific
## 6                Europe

summary(Final_df_log)

##  Revenue.in.2022_log Employees_log   Last.Revenue.Growth Industry.Category 
##  Min.   :0.6931      Min.   :0.000   Min.   :-16.440     Length:595        
##  1st Qu.:1.0986      1st Qu.:2.833   1st Qu.: -4.405     Class :character  
##  Median :1.3863      Median :3.258   Median :  0.780     Mode  :character  
##  Mean   :1.7724      Mean   :3.448   Mean   :  1.181                       
##  3rd Qu.:2.3979      3rd Qu.:4.052   3rd Qu.:  6.545                       
##  Max.   :6.1092      Max.   :7.179   Max.   : 21.880                       
##  Location.Category 
##  Length:595        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

# check for duplicate rows
duplicaterow <- sum(duplicated(df))
cat("duplicate data :", duplicaterow)

## duplicate data : 0

#check for missing data
colSums(is.na(Final_df_log))

## Revenue.in.2022_log       Employees_log Last.Revenue.Growth   Industry.Category 
##                   0                   0                   0                   0 
##   Location.Category 
##                   0

Encode categorical variables (Last.Revenue.Growth)

This is to set the dependent variable as the binary outcome (positive revenue growth or not), and the independent variables as the numbers of employee.

# Encode categorical variable 
Final_df_log$Encoded.Last.Revenue.Growth <- ifelse(Final_df_log$Last.Revenue.Growth > 0, 1, 0)
# Create factor version here — before the split so all datasets inherit it
Final_df_log$Factor.Last.Revenue.Growth <- as.factor(Final_df_log$Encoded.Last.Revenue.Growth)
# Print the updated dataset
head(Final_df_log,5)

##   Revenue.in.2022_log Employees_log Last.Revenue.Growth     Industry.Category
## 1            6.109248      5.099866                0.05                   F&B
## 2            6.035481      1.791759               -2.72 Electrical components
## 3            6.001415      1.791759                0.03 Electrical components
## 4            5.924256      6.715383                0.41          Retail Trade
## 6            5.010635      5.686975              -10.61                 Other
##       Location.Category Encoded.Last.Revenue.Growth Factor.Last.Revenue.Growth
## 1 North & South America                           1                          1
## 2                Europe                           0                          0
## 3                Europe                           1                          1
## 4          Asia-Pacific                           1                          1
## 6                Europe                           0                          0

Splitting Data

Split the dataset into training and testing sets.

#Count numbers of rows
number_rows <- nrow(Final_df_log)
# Print the number of rows
print(paste("Number of rows:", number_rows))

## [1] "Number of rows: 595"

# Set the seed for reproducibility
set.seed(123)
# Specify the split ratio
split <- 0.80
# Sample for the training set
train_data <- sample(seq_len(nrow(Final_df_log)), split * nrow(Final_df_log))
# Create the training set
training_set <- Final_df_log[train_data, ]
# Create the testing set by excluding the training set indices
testing_set <- Final_df_log[-train_data, ]
# Print the training and testing sets
print(paste("Training set size:", nrow(training_set)))

## [1] "Training set size: 476"

print(paste("Testing set size:", nrow(testing_set)))

## [1] "Testing set size: 119"

print(paste("Training set size:", nrow(training_set)))

## [1] "Training set size: 476"

print(paste("Testing set size:",  nrow(testing_set)))

## [1] "Testing set size: 119"

# Confirm the factor column exists in both sets
cat("Factor column in training set:", 
    "Factor.Last.Revenue.Growth" %in% colnames(training_set), "\n")

## Factor column in training set: TRUE

cat("Factor column in testing set:",  
    "Factor.Last.Revenue.Growth" %in% colnames(testing_set),  "\n")

## Factor column in testing set: TRUE

# Check class balance
cat("Class distribution in training set:\n")

## Class distribution in training set:

print(table(training_set$Factor.Last.Revenue.Growth))

## 
##   0   1 
## 212 264

Logistic Regression Model

# Perform logistic regression
logistic_model <- glm(Encoded.Last.Revenue.Growth ~ Employees_log, data = training_set, family = "binomial")
# Print the summary of the logistic regression model
summary(logistic_model)

## 
## Call:
## glm(formula = Encoded.Last.Revenue.Growth ~ Employees_log, family = "binomial", 
##     data = training_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.62140    0.30536   2.035   0.0419 *
## Employees_log -0.11649    0.08416  -1.384   0.1663  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 654.18  on 475  degrees of freedom
## Residual deviance: 652.26  on 474  degrees of freedom
## AIC: 656.26
## 
## Number of Fisher Scoring iterations: 4

Model Evaluation:

Evaluate the model’s performance on the testing dataset using appropriate metrics such as accuracy, precision, recall, and F1 score.

# Predict on the testing set
predicted_probs <- predict(logistic_model, newdata = testing_set, type = "response")
predicted_Revenue.growth <- ifelse(predicted_probs > 0.5, 1, 0)
# Evaluate the model
conf_matrix <- table(
  Predicted = predicted_Revenue.growth,
  Actual    = testing_set$Encoded.Last.Revenue.Growth
)
print(conf_matrix)

##          Actual
## Predicted  0  1
##         0  2  5
##         1 54 58

Interpretation

Results:

1.(True Positives) The model correctly predicted positive revenue growth in 58 instances .

2.(False Positive) The model incorrectly predicted positive revenue growth in 5 instance where there was no actual positive revenue growth .

3.(False Negative) The model failed to predict positive revenue growth in 54 instances where there was actual positive revenue growth .

4.(True Negative) The model correctly predicted no positive revenue growth in 2 instance .

# Confusion matrix values
TP <- 58
FP <- 5
FN <- 54
TN <- 2

# Calculate Accuracy
accuracy <- (TP + TN) / (TP + FP + FN + TN)
# Calculate Precision
precision <- TP / (TP + FP)
# Calculate Recall (Sensitivity or True Positive Rate)
recall <- TP / (TP + FN)
# Calculate Specificity (True Negative Rate)
specificity <- TN / (TN + FP)
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)

# Print the results
print(paste("Accuracy:", round(accuracy, 4)))

## [1] "Accuracy: 0.5042"

print(paste("Precision:", round(precision, 4)))

## [1] "Precision: 0.9206"

print(paste("Recall:", round(recall, 4)))

## [1] "Recall: 0.5179"

print(paste("Specificity:", round(specificity, 4)))

## [1] "Specificity: 0.2857"

print(paste("F1 Score:", round(f1_score, 4)))

## [1] "F1 Score: 0.6629"

5-fold cross-validation for more reliable accuracy estimate

cv_control <- trainControl(method = "cv", number = 5, classProbs = FALSE)

Final_df_log$CV.Target <- as.factor(Final_df_log$Encoded.Last.Revenue.Growth)

cv_model <- train(
  CV.Target ~ Employees_log,
  data      = Final_df_log,
  method    = "glm",
  family    = "binomial",
  trControl = cv_control
)

print(cv_model)

## Generalized Linear Model 
## 
## 595 samples
##   1 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 476, 475, 476, 476, 477 
## Resampling results:
## 
##   Accuracy   Kappa       
##   0.5428586  -0.009902522

cat("Cross-validated Accuracy:", round(max(cv_model$results$Accuracy), 4), "\n")

## Cross-validated Accuracy: 0.5429

Random Forest

# Train the Random Forest model
rf_model <- randomForest::randomForest(Factor.Last.Revenue.Growth ~  Employees_log, data = training_set)
# Make predictions on the testing set
predicted_Revenue.growth <- predict(rf_model, newdata = testing_set, type = "response")

Model Evaluation

# Evaluate the model (e.g., confusion matrix, accuracy, etc.)
conf_matrix <- table(predicted_Revenue.growth, testing_set$Factor.Last.Revenue.Growth)
# Create a new confusion matrix with updated labels
rf_conf_matrix <- conf_matrix
# Rename rows (predicted labels)
rownames(rf_conf_matrix) <- c("Actual 0", "Actual 1")
# Rename columns (true labels)
colnames(rf_conf_matrix) <- c("Predicted 0", "Predicted 1")
# Print the updated confusion matrix
print(rf_conf_matrix)

##                         
## predicted_Revenue.growth Predicted 0 Predicted 1
##                 Actual 0          27          27
##                 Actual 1          29          36

5-fold cross-validation for more reliable accuracy estimate

cv_control <- trainControl(method = "cv", number = 5, classProbs = FALSE)

Final_df_log$CV.Target <- as.factor(Final_df_log$Encoded.Last.Revenue.Growth)

cv_model <- train(
  CV.Target ~ Employees_log,
  data      = Final_df_log,
  method    = "rf",
  family    = "binomial",
  trControl = cv_control
)

print(cv_model)

## Random Forest 
## 
## 595 samples
##   1 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 477, 476, 475, 477, 475 
## Resampling results:
## 
##   Accuracy   Kappa     
##   0.5345544  0.05979448
## 
## Tuning parameter 'mtry' was held constant at a value of 2

cat("Cross-validated Accuracy:", round(max(cv_model$results$Accuracy), 4), "\n")

## Cross-validated Accuracy: 0.5346

Interpretation

Results:

1.(True Positives) The model correctly predicted positive revenue growth in 37 instances .

2.(False Positive) The model incorrectly predicted positive revenue growth in 26 instance where there was no actual positive revenue growth .

3.(False Negative) The model failed to predict positive revenue growth in 30 instances where there was actual positive revenue growth .

4.(True Negative) The model correctly predicted no positive revenue growth in 26 instance .

# Confusion matrix values
TP <- 36
FP <- 27
FN <- 29
TN <- 27

# Calculate Accuracy
accuracy <- (TP + TN) / (TP + FP + FN + TN)
# Calculate Precision
precision <- TP / (TP + FP)
# Calculate Recall (Sensitivity or True Positive Rate)
recall <- TP / (TP + FN)
# Calculate Specificity (True Negative Rate)
specificity <- TN / (TN + FP)
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)

# Print the results
print(paste("Accuracy:", round(accuracy, 4)))

## [1] "Accuracy: 0.5294"

print(paste("Precision:", round(precision, 4)))

## [1] "Precision: 0.5714"

print(paste("Recall:", round(recall, 4)))

## [1] "Recall: 0.5538"

print(paste("Specificity:", round(specificity, 4)))

## [1] "Specificity: 0.5"

print(paste("F1 Score:", round(f1_score, 4)))

## [1] "F1 Score: 0.5625"

Naive Bayes model

# Train a Naive Bayes model
naive_bayes_model <- naiveBayes(Encoded.Last.Revenue.Growth ~ Employees_log, data = training_set)
# Make predictions on the testing set
predicted_probs <- predict(naive_bayes_model, testing_set, type = "raw")
# Convert predicted probabilities to labels 
predicted_Revenue.growth <- predict(naive_bayes_model, testing_set, type = "class")

Model Evaluation

# Create a confusion matrix
conf_matrix <- table(predicted_Revenue.growth,testing_set$Encoded.Last.Revenue.Growth)
# Create a new confusion matrix with updated labels
nb_conf_matrix <- conf_matrix
# Rename rows (predicted labels)
rownames(nb_conf_matrix) <- c("Actual 0", "Actual 1")
# Rename columns (true labels)
colnames(nb_conf_matrix) <- c("Predicted 0", "Predicted 1")
# Print the updated confusion matrix
print(nb_conf_matrix)

##                         
## predicted_Revenue.growth Predicted 0 Predicted 1
##                 Actual 0           0           0
##                 Actual 1          56          63

Results:

1.(True Positives) The model correctly predicted positive revenue growth in 56 instances .

2.(False Positive) The model incorrectly predicted positive revenue growth in 0 instance where there was no actual positive revenue growth .

3.(False Negative) The model failed to predict positive revenue growth in 63 instances where there was actual positive revenue growth .

4.(True Negative) The model correctly predicted no positive revenue growth in 0 instance .

# Confusion matrix values
TP <- 56
FP <- 0
FN <- 63
TN <- 0

# Calculate Accuracy
accuracy <- (TP + TN) / (TP + FP + FN + TN)
# Calculate Precision
precision <- TP / (TP + FP)
# Calculate Recall (Sensitivity or True Positive Rate)
recall <- TP / (TP + FN)
# Calculate Specificity (True Negative Rate)
specificity <- TN / (TN + FP)
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)

# Print the results
print(paste("Accuracy:", round(accuracy, 4)))

## [1] "Accuracy: 0.4706"

print(paste("Precision:", round(precision, 4)))

## [1] "Precision: 1"

print(paste("Recall:", round(recall, 4)))

## [1] "Recall: 0.4706"

print(paste("Specificity:", round(specificity, 4)))

## [1] "Specificity: NaN"

print(paste("F1 Score:", round(f1_score, 4)))

## [1] "F1 Score: 0.64"

Conclusion

Several classification model is trained (Logistic Regression, Random Forest, Naives Bayes) to predict the likelihood of positive revenue growth in 2023. The accuracy of each model are 0.5042, 0.5294, 0.4706 respectively.

In conclusion,the regression model confirms a statistically significant positive relationship between employee count and revenue (R² = 0.28, RMSE = 0.79). However, R² = 0.28 means employee count explains only 28% of the variation in revenue — the remaining 72% is driven by factors not captured in this model. Employee count alone is therefore a weak practical predictor, and future work should incorporate additional features such as industry, location, and company age.

The 2 objectives are answered even the model result are not producing satisfactory result. High predictive accuracy is not feasible mainly due to the data limitation. Even though both model doesn’t produce satisfactory results but it’s important to conduct a thorough evaluation and analysis to understand the limitations and potential areas for improvement.

Action for further analysis:

Re-evaluate Data Quality: Data cleaning and pre-processing
Feature Engineering: select the correct feature based on the correlation matrix.

Company Revenue Growth Analysis & Prediction

To explore and analyze the relationship between key factors affecting a company’s revenue and its potential for achieving positive revenue growth in the upcoming year.

Introduction

Data Obtain

Data Cleaning

Exploratory Data Analytics (EDA)

Understanding Dataset

Understanding Numeric Variables

Understanding Categorical Variables

Analyzing Data

Numerical vs. Numerical Correlation

Categorical vs. Numerical Correlation

Summary

Regression Model

Split Data

Data Modeling

Analyze Results

Classification Model

Data Preparation

Splitting Data

Logistic Regression Model

Model Evaluation:

Interpretation

5-fold cross-validation for more reliable accuracy estimate

Random Forest

Model Evaluation

5-fold cross-validation for more reliable accuracy estimate

Interpretation

Naive Bayes model

Model Evaluation

Conclusion