The dataset contains information about 600 e-commerce companies, including their revenue in 2022, employee count, past revenue growth, industry category, and location category.
The objectives of this project are:
To explore the relationship between a company’s employee count and revenue in 2022
To build a predictive model that can assess the likelihood of a company achieving positive revenue growth in the year 2023.
The dataset is obtained from the source of https://www.statista.com/companies/search?queries%5B%5D=ecommerce.
The basic information such as column names, total row numbers, total column numbers, data types and etc. are identified as followed by steps belows:
# Set CRAN mirror
setwd("D:/MDS project/WQD7004 PROGRAMMING FOR DATA SCIENCE")
df <- read.csv("Revenue Prediction Dataset.csv")
#check data information
head(df,5)
## Name Location City State
## 1 Ajso Global Ecommerce LLC United States Anchorage Alaska
## 2 Lcc Best Ecommerce Ltd. United Kingdom Cardiff South Glamorgan
## 3 Astraza Ecommerce Ltd. United Kingdom Hayes Middlesex
## 4 Shanghai Kaijie Ecommerce Co., Ltd. China Shanghai Shanghai
## 5 Grays Ecommerce Group Limited Australia Moorebank New South Wales
## Industry Revenue.in.2022
## 1 Tobacco $450.9m USD
## 2 Electronic & Telecommunication Equipment & Parts $418.3m USD
## 3 Electronic & Telecommunication Equipment & Parts $404.3m USD
## 4 Retail Trade, Except Motor Vehicles $374.2m USD
## 5 E-Commerce $162.8m USD
## Last.Revenue.Growth ISO.Code Employees IPO.Status Operating.Status
## 1 0.05% USD 164 Private Active
## 2 -2.72% USD 6 Private Active
## 3 0.03% USD 6 Private Active
## 4 0.41% USD 825 Private Active
## 5 1966.76% USD 652 Private Active
## Founding.Year...Est.
## 1 n/a
## 2 2018
## 3 2018
## 4 2010
## 5 1992
print(paste("Number of records: ", nrow(df)))
## [1] "Number of records: 600"
print(paste("Number of features: ", ncol(df)))
## [1] "Number of features: 12"
colnames(df)
## [1] "Name" "Location" "City"
## [4] "State" "Industry" "Revenue.in.2022"
## [7] "Last.Revenue.Growth" "ISO.Code" "Employees"
## [10] "IPO.Status" "Operating.Status" "Founding.Year...Est."
summary(df)
## Name Location City State
## Length:600 Length:600 Length:600 Length:600
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Industry Revenue.in.2022 Last.Revenue.Growth ISO.Code
## Length:600 Length:600 Length:600 Length:600
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Employees IPO.Status Operating.Status Founding.Year...Est.
## Length:600 Length:600 Length:600 Length:600
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
str(df)
## 'data.frame': 600 obs. of 12 variables:
## $ Name : chr "Ajso Global Ecommerce LLC" "Lcc Best Ecommerce Ltd." "Astraza Ecommerce Ltd." "Shanghai Kaijie Ecommerce Co., Ltd." ...
## $ Location : chr "United States" "United Kingdom" "United Kingdom" "China" ...
## $ City : chr "Anchorage" "Cardiff" "Hayes" "Shanghai" ...
## $ State : chr "Alaska" "South Glamorgan" "Middlesex" "Shanghai" ...
## $ Industry : chr "Tobacco" "Electronic & Telecommunication Equipment & Parts" "Electronic & Telecommunication Equipment & Parts" "Retail Trade, Except Motor Vehicles" ...
## $ Revenue.in.2022 : chr "$450.9m USD" "$418.3m USD" "$404.3m USD" "$374.2m USD" ...
## $ Last.Revenue.Growth : chr "0.05%" "-2.72%" "0.03%" "0.41%" ...
## $ ISO.Code : chr "USD" "USD" "USD" "USD" ...
## $ Employees : chr "164" "6" "6" "825" ...
## $ IPO.Status : chr "Private" "Private" "Private" "Private" ...
## $ Operating.Status : chr "Active" "Active" "Active" "Active" ...
## $ Founding.Year...Est.: chr "n/a" "2018" "2018" "2010" ...
#Remove unnecessary column
df1 <- df[,c("Location", "Industry", "Revenue.in.2022","Employees","Last.Revenue.Growth","Founding.Year...Est.")]
head(df1,5)
## Location Industry
## 1 United States Tobacco
## 2 United Kingdom Electronic & Telecommunication Equipment & Parts
## 3 United Kingdom Electronic & Telecommunication Equipment & Parts
## 4 China Retail Trade, Except Motor Vehicles
## 5 Australia E-Commerce
## Revenue.in.2022 Employees Last.Revenue.Growth Founding.Year...Est.
## 1 $450.9m USD 164 0.05% n/a
## 2 $418.3m USD 6 -2.72% 2018
## 3 $404.3m USD 6 0.03% 2018
## 4 $374.2m USD 825 0.41% 2010
## 5 $162.8m USD 652 1966.76% 1992
df1[df1 == "n/a"] <- NA #change n/a to NA (standard representation of missing value)
colSums(is.na(df1))
## Location Industry Revenue.in.2022
## 0 51 0
## Employees Last.Revenue.Growth Founding.Year...Est.
## 1 1 256
#Cleaning for categorical data
#for industry column
df1$Industry.Category <- "Other" # Default
# Consolidate based on keywords
industry_hierarchy <- list(
"IT & Computer Services" = c("Computer Programming", "Computer Programming, Consultancy", "Other IT & Computer Services"),
"Telecommunications" = c("Telecommunications", "Other Telecommunications"),
"Finance & Insurance" = c("Finance-Related Services, Except Insurance & Pension Funding", "Finance & Insurance", "Finance & Insurance-Related Activities", "Reinsurance", "Trusts, Funds & Similar Financial Entities","Insurance & Pension-Related Services"),
"Retail Trade" = c("Wholesale Trade, Except Motor Vehicles","Dairy Products","Non-Specialized Wholesale Trade","Retail Trade, Except Motor Vehicles", "Retail Sales via Stalls & Markets", "Other Non-Specialized Stores", "Other Specialized Wholesale Activities","Wholesale & Retail Trade, Including Motor Vehicle Repairs" ),
"Electrical components" = c("Electronic Components & Boards","Electronic & Telecommunication Equipment & Parts"),
"E-commerce" = "E-Commerce",
"Media" = c("Information & Communication","Clothing, Footwear & Leather Articles","Electrical Household Appliances, Furniture, Lighting Equipment & Other Articles","Advertising","Radio Broadcasting","Books, Newspapers & Stationery"),
"Textiles" = c("Textiles, Clothing & Footwear","Textiles","Other Textiles"),
"Construction" = c("Real Estate","Construction") ,
"F&B" = c("Food, Beverages & Tobacco Wholesales","Food","Soft Drinks & Water","Malt Liquors & Malt","Other Food Services","Tobacco"),
"Transportation & Storage" = c("Postal Services","Transportation & Storage", "Other Transportation Support Activities", "Warehousing & Storage","Courier Services"),
"Administration" = c("Computer Consultancy & Facilities Management","Administrative & Supportive Services","Management Consultancy"),
"Professional" = c("Electrical Household Appliances, Furniture, Lighting Equipment & Other Articles","Travel Agencies","Professional, Scientific & Technical Services","Education","Research & Experimental Development on Natural Sciences & Engineering","Pharmaceutical or Medical Goods, Cosmetic & Toilet Articles") ,
"Other" = "Other" # Default category for unmatched values
)
# Loop through hierarchy and apply consolidation
for (parent_category in names(industry_hierarchy)) {
subcategories <- industry_hierarchy[[parent_category]]
df1$Industry.Category[grepl(paste(subcategories, collapse = "|"), df1$Industry, ignore.case = TRUE)] <- parent_category
}
industry_counts <- table(df1$Industry.Category)
industry_counts
##
## Administration Construction E-commerce
## 59 7 64
## Electrical components F&B Finance & Insurance
## 3 11 13
## IT & Computer Services Media Other
## 70 16 164
## Professional Retail Trade Telecommunications
## 7 154 16
## Textiles Transportation & Storage
## 6 10
#For location column
table(df1$Location)
##
## Australia Austria Bahrain
## 5 1 1
## Belgium Brazil Canada
## 1 1 7
## China Denmark Egypt
## 98 1 1
## Finland France Germany
## 2 1 80
## Hong Kong SAR China India Indonesia
## 7 228 4
## Ireland Israel Italy
## 2 1 8
## Jordan Lebanon Libya
## 3 1 1
## Malaysia Mauritius Morocco
## 8 1 2
## Netherlands New Zealand Nigeria
## 2 2 1
## Norway Philippines Poland
## 1 1 2
## Qatar Russia Saudi Arabia
## 2 1 2
## Singapore South Africa South Korea
## 3 4 1
## Spain Sweden Taiwan
## 5 1 1
## Tanzania Thailand United Arab Emirates
## 1 4 3
## United Kingdom United States Uzbekistan
## 43 46 1
## Vietnam
## 8
df1$location.Category <- "Other" # Default
location_hierarchy <- list(
"Australia & New Zealand" = c("Australia", "New Zealand"),
"Europe" = c("Austria", "Belgium", "Denmark", "Finland", "France", "Germany", "Ireland","United Kingdom",
"Italy", "Netherlands", "Norway", "Poland", "Spain", "Sweden","Russia") ,
"Middle East" = c("Bahrain", "Jordan", "Lebanon", "Qatar", "Saudi Arabia", "United Arab Emirates"),
"North & South America" = c("Brazil", "Canada", "United States"),
"Asia-Pacific" =c("China", "Hong Kong SAR China", "India", "Indonesia", "Israel", "Japan",
"Malaysia", "Philippines", "Singapore", "South Korea", "Taiwan", "Thailand",
"Vietnam","Uzbekistan"),
"Africa" = c("Egypt", "Libya", "Mauritius", "Morocco", "Nigeria", "South Africa", "Tanzania"),
"Other" = "Other"
)
# Loop through hierarchy and apply consolidation
for (country in names(location_hierarchy)) {
subcategories <- location_hierarchy[[country]]
df1$Location.Category[grepl(paste(subcategories, collapse = "|"), df1$Location, ignore.case = TRUE)] <- country
}
location_counts <- table(df1$Location.Category)
location_counts
##
## Africa Asia-Pacific Australia & New Zealand
## 11 365 7
## Europe Middle East North & South America
## 151 12 54
After cleaning the categorical data, the numeric data will be cleaned by checking the impact of the variable itself using correlation matrix or based on the percentage of missing value as compared to overall dataset.
#Select only the numeric column for cleaning
Maindf <- df1[,c("Revenue.in.2022","Employees","Last.Revenue.Growth","Founding.Year...Est.")]
#Re-check the data information
print(paste("Number of records: ", nrow(Maindf)))
## [1] "Number of records: 600"
print(paste("Number of features: ", ncol(Maindf)))
## [1] "Number of features: 4"
colnames(Maindf)
## [1] "Revenue.in.2022" "Employees" "Last.Revenue.Growth"
## [4] "Founding.Year...Est."
summary(Maindf)
## Revenue.in.2022 Employees Last.Revenue.Growth Founding.Year...Est.
## Length:600 Length:600 Length:600 Length:600
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
str(Maindf)
## 'data.frame': 600 obs. of 4 variables:
## $ Revenue.in.2022 : chr "$450.9m USD" "$418.3m USD" "$404.3m USD" "$374.2m USD" ...
## $ Employees : chr "164" "6" "6" "825" ...
## $ Last.Revenue.Growth : chr "0.05%" "-2.72%" "0.03%" "0.41%" ...
## $ Founding.Year...Est.: chr NA "2018" "2018" "2010" ...
# Count the number of duplicate rows
num_duplicates <- sum(duplicated(df))
print(num_duplicates)
## [1] 0
#check missing data
any_missing <- any(Maindf == "n/a", na.rm = TRUE)
Maindf[Maindf == "n/a"] <- NA #change n/a to NA (standard representation of missing value)
colSums(is.na(Maindf)) #count number of NA per column
## Revenue.in.2022 Employees Last.Revenue.Growth
## 0 1 1
## Founding.Year...Est.
## 256
colSums(!is.na(Maindf)) #check number of filled values per column
## Revenue.in.2022 Employees Last.Revenue.Growth
## 600 599 599
## Founding.Year...Est.
## 344
is.null(Maindf) #check nothingness
## [1] FALSE
Since there are only 1 rows of Employees and 1 rows of Last.Revenue.Growth are having missing values, which is less than 10% of the overall dataset, both rows will be removed.
#remove missing column of employees and revenue
clean_Employees <- Maindf[complete.cases(Maindf$Employees), ]
print(sum(is.na(clean_Employees$Employees)))
## [1] 0
print(sum(!is.na(clean_Employees$Employees)))
## [1] 599
clean_Revenue <- clean_Employees[complete.cases(clean_Employees$Last.Revenue.Growth), ]
clean_Maindf<- clean_Revenue
print(sum(is.na(clean_Maindf$Last.Revenue.Growth)))
## [1] 0
print(sum(!is.na(clean_Maindf$Last.Revenue.Growth)))
## [1] 598
summary(clean_Maindf)
## Revenue.in.2022 Employees Last.Revenue.Growth Founding.Year...Est.
## Length:598 Length:598 Length:598 Length:598
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
colSums(is.na(clean_Maindf))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 0 0 0
## Founding.Year...Est.
## 256
colSums(!is.na(clean_Maindf))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 598 598 598
## Founding.Year...Est.
## 342
There are 256/600 missing values for Founding.Year…Est. Hence, the data frame is converted from character to numeric in order to perform the impact assessment using correlation matrix. Meanwhile, all “NA” rows under Founding.Year…Est. will be temporary eliminated during correlation matrix calculation.
#Before calculation, need to convert all from character to numeric
unique(clean_Maindf$Employees) #to check why unable to direct convert
## [1] "164" "6" "825" "652" "295" "296" "290" "376" "292"
## [10] "291" "293" "299" "289" "294" "391" "298" "297" "22"
## [19] "196" "135" "369" "467" "46" "891" "261" "924" "16"
## [28] "97" "112" "94" "313" "184" "110" "98" "27" "56"
## [37] "14" "29" "99" "510" "60" "109" "23" "89" "26"
## [46] "36" "1" "964" "146" "88" "63" "31" "21" "191"
## [55] "337" "301" "20" "19" "13" "12" "310" "216" "58"
## [64] "148" "86" "70" "5" "50" "49" "134" "118" "121"
## [73] "136" "87" "199" "214" "41" "47" "55" "40" "193"
## [82] "44" "24" "11" "236" "17" "157" "3" "96" "176"
## [91] "38" "188" "79" "69" "66" "67" "42" "30" "25"
## [100] "65" "150" "18" "151" "2" "77" "53" "155" "4"
## [109] "28" "10" "9" "271" "75" "37" "34" "33" "7"
## [118] "35" "113" "102" "82" "32" "81" "115" "103" "78"
## [127] "51" "95" "83" "54" "325" "15" "39" "93" "92"
## [136] "541" "48" "8" "43" "64" "1,311" "59" "57" "71"
sapply(clean_Maindf, class)
## Revenue.in.2022 Employees Last.Revenue.Growth
## "character" "character" "character"
## Founding.Year...Est.
## "character"
str(clean_Maindf)
## 'data.frame': 598 obs. of 4 variables:
## $ Revenue.in.2022 : chr "$450.9m USD" "$418.3m USD" "$404.3m USD" "$374.2m USD" ...
## $ Employees : chr "164" "6" "6" "825" ...
## $ Last.Revenue.Growth : chr "0.05%" "-2.72%" "0.03%" "0.41%" ...
## $ Founding.Year...Est.: chr NA "2018" "2018" "2010" ...
# Revenue in 2022: Remove "$" and "m USD" and convert to numeric
clean_Maindf$Revenue.in.2022 <- as.numeric(substr(clean_Maindf$Revenue.in.2022, 2, nchar(clean_Maindf$Revenue.in.2022) - 6))
# Last Revenue Growth: Remove "%"
clean_Maindf$Last.Revenue.Growth <- as.numeric(substr(clean_Maindf$Last.Revenue.Growth, 1, nchar(clean_Maindf$Last.Revenue.Growth) - 1))
# Other: normal convert
clean_Maindf$Founding.Year...Est. <- as.numeric(clean_Maindf$Founding.Year...Est.)
clean_Maindf$Employees <- as.numeric(gsub(",", "", clean_Maindf$Employees))
clean_Maindf$Employees <- as.numeric(clean_Maindf$Employees)
colSums(is.na(clean_Maindf))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 0 0 0
## Founding.Year...Est.
## 256
colSums(!is.na(clean_Maindf))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 598 598 598
## Founding.Year...Est.
## 342
#For Funding Years, to check whether the variable is impact on revenue, correlation matrix is calculated after eliminate all NA column.
#Clean NA rows for Founding year
clean_Founding <- clean_Maindf[complete.cases(clean_Maindf$Founding.Year...Est.), ]
print(sum(is.na(clean_Founding$Founding.Year...Est.)))
## [1] 0
print(sum(!is.na(clean_Founding$Founding.Year...Est.)))
## [1] 342
colSums(is.na(clean_Founding))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 0 0 0
## Founding.Year...Est.
## 0
colSums(!is.na(clean_Founding))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 342 342 342
## Founding.Year...Est.
## 342
head(clean_Founding,5)
## Revenue.in.2022 Employees Last.Revenue.Growth Founding.Year...Est.
## 2 418 6 -2.72 2018
## 3 404 6 0.03 2018
## 4 374 825 0.41 2010
## 5 162 652 1966.76 1992
## 6 150 295 -10.61 2019
# Correlation Matrix Calculation
cor_matrix <- cor(clean_Founding)
print(cor_matrix)
## Revenue.in.2022 Employees Last.Revenue.Growth
## Revenue.in.2022 1.0000000 0.41195643 0.1636159
## Employees 0.4119564 1.00000000 0.1997368
## Last.Revenue.Growth 0.1636159 0.19973677 1.0000000
## Founding.Year...Est. 0.0363452 -0.07226679 -0.1898559
## Founding.Year...Est.
## Revenue.in.2022 0.03634520
## Employees -0.07226679
## Last.Revenue.Growth -0.18985586
## Founding.Year...Est. 1.00000000
Based on Correlation Matrix, Founding year have no correlation with any of the other value. Hence, by considering it consists of high number of missing value, the entire column is decided to be removed and new data frame is created.
#New data frame after remove unnecessary columns.
Final_df <- clean_Maindf[,c("Revenue.in.2022","Employees","Last.Revenue.Growth")]
head(Final_df,5)
## Revenue.in.2022 Employees Last.Revenue.Growth
## 1 450 164 0.05
## 2 418 6 -2.72
## 3 404 6 0.03
## 4 374 825 0.41
## 5 162 652 1966.76
sapply(Final_df, class)
## Revenue.in.2022 Employees Last.Revenue.Growth
## "numeric" "numeric" "numeric"
summary(Final_df)
## Revenue.in.2022 Employees Last.Revenue.Growth
## Min. : 2.00 Min. : 1.00 Min. : -50.280
## 1st Qu.: 3.00 1st Qu.: 17.00 1st Qu.: -4.407
## Median : 4.00 Median : 26.00 Median : 0.785
## Mean : 15.38 Mean : 64.91 Mean : 4.736
## 3rd Qu.: 11.00 3rd Qu.: 58.00 3rd Qu.: 6.570
## Max. :450.00 Max. :1311.00 Max. :1966.760
colSums(is.na(Final_df))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 0 0 0
colSums(!is.na(Final_df))
## Revenue.in.2022 Employees Last.Revenue.Growth
## 598 598 598
Effectiveness of Data Cleaning will be check using Histogram for skewness and BOX-plot for outlier. For skewness, due to both columns of Employees and Revenue.in.2022 shows highly positive skewness towards right side, the log-transformation method have been used to transform the data towards more symmetric result. By comparing the result of original data frame and log-transformed data frame,the similar results shows least impact of log-transform to correlation matrix.Hence, the transformation is acceptable and will not impact on prediction results.
# Check data types
class(Final_df$Employees)
## [1] "numeric"
# Convert to numeric
#your_data$variable <- as.numeric(your_data$variable)
# Install and load the moments package if not already installed
# Raw data skewness
skew_bf_Emp <- skewness(Final_df$Employees)
skew_bf_Rev <- skewness(Final_df$Revenue.in.2022)
print(paste("Original Skewness of Employees:",skew_bf_Emp))
## [1] "Original Skewness of Employees: 5.22762327599016"
print(paste("Original Skewness of Revenue in 2022:",skew_bf_Rev))
## [1] "Original Skewness of Revenue in 2022: 6.77879926639424"
# Log transformation
Final_df$Employees_log <- log(Final_df$Employees)
Final_df$Revenue.in.2022_log <- log(Final_df$Revenue.in.2022)
# Check the skewness after transformation
skew_af_Emp <- skewness(Final_df$Employees_log)
skew_af_Rev <- skewness(Final_df$Revenue.in.2022_log)
print(paste("Log-Transformed Skewness of Employees:",skew_af_Emp))
## [1] "Log-Transformed Skewness of Employees: 0.338800603770846"
print(paste("Log-Transformed Skewness of Revenue in 2022:",skew_af_Rev))
## [1] "Log-Transformed Skewness of Revenue in 2022: 1.45478092705567"
# Create histograms for both the original and log-transformed variables
par(mfrow=c(1,2)) # Set up a 1x2 grid for side-by-side plots
# Original variable
hist(Final_df$Revenue.in.2022,
main = "Histogram of Original",
xlab = "Revenue.in.2022",
col = "lightblue",
border = "black")
# Transformed variable
hist(log(Final_df$Revenue.in.2022),
main = "Histogram of Log-Transformed",
xlab = "log(Revenue.in.2022)",
col = "lightgreen",
border = "black")
# Original variable
hist(Final_df$Employees,
main = "Histogram of Original",
xlab = "Employees",
col = "lightblue",
border = "black")
# Transformed variable
hist(log(Final_df$Employees),
main = "Histogram of Log-Transformed",
xlab = "log(Employees)",
col = "lightgreen",
border = "black")
# Correlation Matrix Calculation
cor_matrix <- cor(Final_df)
print(cor_matrix)
## Revenue.in.2022 Employees Last.Revenue.Growth Employees_log
## Revenue.in.2022 1.0000000 0.4200475 0.1490019 0.3410082
## Employees 0.4200475 1.0000000 0.2009087 0.7411684
## Last.Revenue.Growth 0.1490019 0.2009087 1.0000000 0.1095717
## Employees_log 0.3410082 0.7411684 0.1095717 1.0000000
## Revenue.in.2022_log 0.7358805 0.5179241 0.1267887 0.5262732
## Revenue.in.2022_log
## Revenue.in.2022 0.7358805
## Employees 0.5179241
## Last.Revenue.Growth 0.1267887
## Employees_log 0.5262732
## Revenue.in.2022_log 1.0000000
#The results shows least impact of log-transform to correlation matrix.The transformation is acceptable.
Next, outlier have been verified using BOX-plot for log-values. By using the range of 3.0 (range: 1.5-3.0 for normal to moderately skeness data), the results shows there are outliers existing in the data frame for column Last.Revenue.Growth. Hence, the outliers are removed and new BOX-plot after outlier removal is plotted.
head (Final_df,5)
## Revenue.in.2022 Employees Last.Revenue.Growth Employees_log
## 1 450 164 0.05 5.099866
## 2 418 6 -2.72 1.791759
## 3 404 6 0.03 1.791759
## 4 374 825 0.41 6.715383
## 5 162 652 1966.76 6.480045
## Revenue.in.2022_log
## 1 6.109248
## 2 6.035481
## 3 6.001415
## 4 5.924256
## 5 5.087596
#BOX plot for checking outliers
par(mar = c(5, 8, 4, 2))
Final_df_log1 <- Final_df[, c("Revenue.in.2022_log", "Employees_log", "Last.Revenue.Growth")]
boxplot(Final_df_log1, col = c("lightblue", "lightgreen", "lightpink"), horizontal = TRUE, main = "Horizontal Box Plot", range = 3.0)
#Remove outliers
outliers_col <- boxplot.stats(Final_df_log1$Last.Revenue.Growth)$out
outliers <- unique(c(outliers_col))
outliers
## [1] 1966.76 213.08 -50.28
Final_df_log <- Final_df_log1[!Final_df_log1$Last.Revenue.Growth %in% outliers, ]
#BOX plot after remove outliers
boxplot(Final_df_log, col = c("lightblue", "lightgreen", "lightpink"), horizontal = TRUE, main = "Horizontal Box Plot (Outliers Removed)", range = 3.0)
head(Final_df,5)
## Revenue.in.2022 Employees Last.Revenue.Growth Employees_log
## 1 450 164 0.05 5.099866
## 2 418 6 -2.72 1.791759
## 3 404 6 0.03 1.791759
## 4 374 825 0.41 6.715383
## 5 162 652 1966.76 6.480045
## Revenue.in.2022_log
## 1 6.109248
## 2 6.035481
## 3 6.001415
## 4 5.924256
## 5 5.087596
head(Final_df_log,5)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth
## 1 6.109248 5.099866 0.05
## 2 6.035481 1.791759 -2.72
## 3 6.001415 1.791759 0.03
## 4 5.924256 6.715383 0.41
## 6 5.010635 5.686975 -10.61
After cleaning all the numeric data, the dataset is combine again with the cleaned categorical data.
Final_df_log <- cbind(Final_df_log, Industry.Category = df1$Industry.Category[match(rownames(Final_df_log), rownames(df1))])
head(Final_df_log,5)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## 1 6.109248 5.099866 0.05 F&B
## 2 6.035481 1.791759 -2.72 Electrical components
## 3 6.001415 1.791759 0.03 Electrical components
## 4 5.924256 6.715383 0.41 Retail Trade
## 6 5.010635 5.686975 -10.61 Other
Final_df_log <- cbind(Final_df_log, Location.Category = df1$Location.Category[match(rownames(Final_df_log), rownames(df1))])
head(Final_df_log,5)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## 1 6.109248 5.099866 0.05 F&B
## 2 6.035481 1.791759 -2.72 Electrical components
## 3 6.001415 1.791759 0.03 Electrical components
## 4 5.924256 6.715383 0.41 Retail Trade
## 6 5.010635 5.686975 -10.61 Other
## Location.Category
## 1 North & South America
## 2 Europe
## 3 Europe
## 4 Asia-Pacific
## 6 Europe
The data frame : Final_df_log, will be used for next process onwards.
# Check column names
colnames(Final_df_log)
## [1] "Revenue.in.2022_log" "Employees_log" "Last.Revenue.Growth"
## [4] "Industry.Category" "Location.Category"
# summarization of all columns
summary(Final_df_log)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## Min. :0.6931 Min. :0.000 Min. :-16.440 Length:595
## 1st Qu.:1.0986 1st Qu.:2.833 1st Qu.: -4.405 Class :character
## Median :1.3863 Median :3.258 Median : 0.780 Mode :character
## Mean :1.7724 Mean :3.448 Mean : 1.181
## 3rd Qu.:2.3979 3rd Qu.:4.052 3rd Qu.: 6.545
## Max. :6.1092 Max. :7.179 Max. : 21.880
## Location.Category
## Length:595
## Class :character
## Mode :character
##
##
##
Industry and Location data are attributed as data with values within categorical variables. Meanwhile, Revenue.in.2022_log, Employees_log, Last.Revenue.Growth are attributed as data with values within numeric variables. Based on this information, it is evident that all the data can be categorized as structured data, as both the categorical and numerical data belong to this classification.
#download necessary libraries
# Ensure only numeric columns
numeric_cols <- Final_df_log[, sapply(Final_df_log, is.numeric)]
#BOX plot
boxplot(numeric_cols, col = c("lightblue", "lightgreen", "lightpink"), horizontal = TRUE, main = "Horizontal Box Plot", range = 3.0)
2. Histogram
par(mfrow = c(1, 3))
# Create histogram for Revenue.in.2022_log
hist(Final_df_log$Revenue.in.2022_log,
main = "Histogram of Revenue.in.2022",
xlab = "Revenue",
col = "skyblue")
# Create histogram for Employees_log
hist(Final_df_log$Employees_log,
main = "Histogram of Employees",
xlab = "Employees",
col = "lightgreen")
# Create histogram for Employees_log with a specific color
hist(Final_df_log$Last.Revenue.Growth,
main = "Histogram of Revenue Growth",
xlab = "Revenue Growth",
col = "pink")
Revenue.in.2022 has less variance compared to Last.Revenue.Growth, however the central tendency Revenue.in.2022 is more skewed compared to Last.Revenue.Growth.
#download necessary libraries
# Create a table for counts of Industry.Category
industry_counts <- table(Final_df_log$Industry.Category)
# Create a table for counts of Location.Category
location_counts <- table(Final_df_log$Location.Category)
# Create data frames for ggplot
industry_data <- data.frame(Var1 = names(industry_counts), Freq = as.vector(industry_counts))
location_data <- data.frame(Var1 = names(location_counts), Freq = as.vector(location_counts))
# Plotting individual bar charts for Industry and Location categories using ggplot2
p1 <- ggplot(industry_data, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = Freq), vjust = -0.5, size = 3, color = "black") +
labs(title = "Frequency of Industry Categories", x = "Industry Category", y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p2 <- ggplot(location_data, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "lightgreen") +
geom_text(aes(label = Freq), vjust = -0.5, size = 3, color = "black") +
labs(title = "Frequency of Location Categories", x = "Location Category", y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Displaying graphs side by side using grid.arrange()
grid.arrange(p1, p2, ncol = 2)
The dataset consist of multiple industries mainly from Asia-Pacific.
To identify the dependent variable as Revenue.in.2022_log or Last.Revenue.Growth by the end of the process.
To break down the process to Numerical vs. Numerical and Categorical vs. Numerical Correlation sincethe data is structured data.
Scatter Plot
# Create a layout with two plots side by side
par(mfrow = c(1, 2))
# Scatter plot for Revenue in 2022 vs. Employees with trendline
plot(Final_df_log$Revenue.in.2022_log, Final_df_log$Employees_log,
xlab = "Revenue in 2022", ylab = "Employees",
main = "Employees vs. Revenue in 2022")
# Fit a linear regression line and add it to the plot
fit <- lm(Employees_log ~ Revenue.in.2022_log, data = Final_df_log)
abline(fit, col = "red")
# Scatter plot for Employees vs. Last Revenue Growth with trendline
plot(Final_df_log$Last.Revenue.Growth, Final_df_log$Employees_log,
xlab = "Last Revenue Growth", ylab = "Employees",
main = "Employees vs. Last Revenue Growth")
# Fit a linear regression line and add it to the plot
fit_last_revenue <- lm(Employees_log ~ Last.Revenue.Growth, data = Final_df_log)
abline(fit_last_revenue, col = "blue")
correlation matrix
correlation_matrix <- cor(Final_df_log[, c("Revenue.in.2022_log", "Employees_log", "Last.Revenue.Growth")])
# Visualize correlation matrix as a heatmap with values inside
# Convert correlation matrix to long format for plotting
correlation_melted <- melt(correlation_matrix)
# Plot heatmap of the correlation matrix with values inside
ggplot(correlation_melted, aes(Var1, Var2, fill = value, label = round(value, 2))) +
geom_tile(color = "white") +
geom_text(color = "black") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Correlation Heatmap with Values") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_fixed()
There are high correlation between Revenue.in.2022_log vs. Employees_log while the Last.Revenue.Growth vs. Employees_log has low correlation.
Correlation analysis for Categorical vs. Numerical is not straight forward as Numerical vs. Numerical.Firstly breakdown the categories and find correlation of the breakdowns to further understand how Revenues are contributing to different parts of categories.
# Unique industry categories
unique_categories <- unique(Final_df_log$Industry.Category)
# Create an empty list to store correlations for each category
correlation_list <- list()
#loop
for (category in unique_categories) {
# Subset data for each category
subset_data <- subset(Final_df_log, Industry.Category == category)
# Calculate correlation between 'Revenue.in.2022_log' and other variables in the subset
correlation <- cor.test(subset_data$Revenue.in.2022_log, subset_data$Employees_log)
# Store correlation information in the list
correlation_list[[category]] <- correlation
}
# Convert the correlation estimates and categories to a data frame
correlation_df <- data.frame(Category = unique_categories,
Correlation = sapply(correlation_list, function(x) x$estimate))
# Sort the data frame by Correlation from highest to lowest
correlation_df <- correlation_df[order(-correlation_df$Correlation), ]
# Print correlations from highest to lowest
for (i in 1:nrow(correlation_df)) {
cat("Correlation for", correlation_df$Category[i], ":", correlation_df$Correlation[i], "\n")
}
## Correlation for Textiles : 0.9720721
## Correlation for Professional : 0.7741121
## Correlation for Other : 0.7206766
## Correlation for Construction : 0.6734928
## Correlation for Finance & Insurance : 0.6247985
## Correlation for F&B : 0.5824068
## Correlation for IT & Computer Services : 0.5681233
## Correlation for E-commerce : 0.546721
## Correlation for Retail Trade : 0.5183457
## Correlation for Telecommunications : 0.5016474
## Correlation for Administration : 0.371465
## Correlation for Media : 0.1360945
## Correlation for Transportation & Storage : -0.09883522
## Correlation for Electrical components : -0.9999847
# Unique location categories
unique_locations <- unique(Final_df_log$Location.Category)
# Create an empty list to store correlations for each location category
correlation_list_location <- list()
# Loop through each unique location category
for (location in unique_locations) {
# Subset data for each location category
subset_data <- subset(Final_df_log, Location.Category == location)
# Calculate correlation between 'Revenue.in.2022_log' and 'Employees_log' in the subset
correlation <- cor.test(subset_data$Revenue.in.2022_log, subset_data$Employees_log)
# Store correlation information in the list
correlation_list_location[[location]] <- correlation
}
# Convert the correlation estimates and categories to a data frame for locations
correlation_df_location <- data.frame(Location = unique_locations,
Correlation = sapply(correlation_list_location, function(x) x$estimate))
# Sort the data frame by Correlation from highest to lowest for locations
correlation_df_location <- correlation_df_location[order(-correlation_df_location$Correlation), ]
# Print correlations from highest to lowest for locations
for (i in 1:nrow(correlation_df_location)) {
cat("Correlation for", correlation_df_location$Location[i], ":", correlation_df_location$Correlation[i], "\n")
}
## Correlation for Middle East : 0.7177035
## Correlation for North & South America : 0.7168009
## Correlation for Australia & New Zealand : 0.5966943
## Correlation for Asia-Pacific : 0.5756662
## Correlation for Europe : 0.5609685
## Correlation for Africa : 0.5587881
The best way to visualize this data is to use the bar chart since there is a lot of industry breakdowns
# Bar chart for correlation_df (Industry.Category)
plot_correlation_df <- ggplot(correlation_df, aes(x = reorder(Category, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = round(Correlation, 2)), size = 2, color = "black") +
labs(title = "Revenue vs Industry Correlation",
x = "Industry Category", y = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Bar chart for correlation_df_location (Location.Category)
plot_correlation_df_location <- ggplot(correlation_df_location, aes(x = reorder(Location, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "lightgreen") +
geom_text(aes(label = round(Correlation, 2)), size = 3, color = "black") +
labs(title = "Revenue vs Location Correlation",
x = "Location Category", y = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Arrange plots side by side on the same row
plot_correlation_df + plot_correlation_df_location +
plot_layout(ncol = 2)
The most have positive correlations with the Revenue.in.2022_log.
Next analyze with Last.Revenue.Growth as the dependent variable and explore the correlations between Industry and Locations.
# Unique industry categories
unique_categories <- unique(Final_df_log$Industry.Category)
# Create an empty list to store correlations for each category
correlation_list_growth <- list()
# Loop through each unique category
for (category in unique_categories) {
# Subset data for each category
subset_data <- subset(Final_df_log, Industry.Category == category)
# Calculate correlation between 'Last.Revenue.Growth' and 'Employees_log' in the subset
correlation <- cor.test(subset_data$Last.Revenue.Growth, subset_data$Employees_log)
# Store correlation information in the list
correlation_list_growth[[category]] <- correlation
}
# Convert the correlation estimates and categories to a data frame for Last.Revenue.Growth
correlation_df_growth <- data.frame(Category = unique_categories,
Correlation = sapply(correlation_list_growth, function(x) x$estimate))
# Sort the data frame by Correlation from highest to lowest for Last.Revenue.Growth
correlation_df_growth <- correlation_df_growth[order(-correlation_df_growth$Correlation), ]
# Print correlations from highest to lowest for Last.Revenue.Growth
for (i in 1:nrow(correlation_df_growth)) {
cat("Correlation for", correlation_df_growth$Category[i], ":", correlation_df_growth$Correlation[i], "\n")
}
## Correlation for Electrical components : 0.8818947
## Correlation for Telecommunications : 0.2229753
## Correlation for Construction : 0.1277163
## Correlation for Transportation & Storage : 0.0847529
## Correlation for Administration : 0.07614184
## Correlation for Professional : 0.06467406
## Correlation for IT & Computer Services : -0.00192065
## Correlation for Retail Trade : -0.01463785
## Correlation for Other : -0.1103358
## Correlation for E-commerce : -0.1374407
## Correlation for Textiles : -0.2457637
## Correlation for Media : -0.3179541
## Correlation for F&B : -0.6200289
## Correlation for Finance & Insurance : -0.6678347
# Unique location categories
unique_locations <- unique(Final_df_log$Location.Category)
# Create an empty list to store correlations for each location category
correlation_list_location_growth <- list()
# Loop through each unique location category
for (location in unique_locations) {
# Subset data for each location category
subset_data <- subset(Final_df_log, Location.Category == location)
# Calculate correlation between 'Last.Revenue.Growth' and 'Employees_log' in the subset
correlation <- cor.test(subset_data$Last.Revenue.Growth, subset_data$Employees_log)
# Store correlation information in the list
correlation_list_location_growth[[location]] <- correlation
}
# Convert the correlation estimates and categories to a data frame for locations
correlation_df_location_growth <- data.frame(Location = unique_locations,
Correlation = sapply(correlation_list_location_growth, function(x) x$estimate))
# Sort the data frame by Correlation from highest to lowest for locations
correlation_df_location_growth <- correlation_df_location_growth[order(-correlation_df_location_growth$Correlation), ]
# Print correlations from highest to lowest for locations
for (i in 1:nrow(correlation_df_location_growth)) {
cat("Correlation for", correlation_df_location_growth$Location[i], ":", correlation_df_location_growth$Correlation[i], "\n")
}
## Correlation for Africa : 0.374507
## Correlation for Middle East : 0.1808134
## Correlation for Asia-Pacific : 0.04474346
## Correlation for Europe : -0.164285
## Correlation for Australia & New Zealand : -0.1943485
## Correlation for North & South America : -0.3390054
bar charts.
# Bar chart for correlation_df_growth (Industry.Category)
plot_correlation_df_growth <- ggplot(correlation_df_growth, aes(x = reorder(Category, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_text(aes(label = round(Correlation, 2)), size = 2, color = "black") +
labs(title = "Last Revenue Growth vs Industry",
x = "Industry Category", y = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Bar chart for correlation_df_location_growth (Location.Category)
plot_correlation_df_location_growth <- ggplot(correlation_df_location_growth, aes(x = reorder(Location, Correlation), y = Correlation)) +
geom_bar(stat = "identity", fill = "lightgreen") +
geom_text(aes(label = round(Correlation, 2)), size = 3, color = "black") +
labs(title = "Last Revenue Growth vs Location",
x = "Location Category", y = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Arrange plots side by side on the same row
plot_correlation_df_growth + plot_correlation_df_location_growth +
plot_layout(ncol = 2)
There are quite a handful of correlation outcomes that has less than 0 correlation.
This section focused on the correlations between dependent variables and independent variables and here are few key takeaways that can consider to structure the models:
Revenue.in.2022_log is suggested to be the main dependent variable as it has high correlation to a lot of independent variables comparatively with Last.Revenue.Growth. However, few techniques need to be applied to cater potential skewness challenges. Regression modeling can be the main model for the project, and using Employees as the independent variable can produce a much more straightforward result.
This section addresses Objective 1: exploring the
relationship between employee count and revenue. The dependent variable
is Revenue.in.2022_log (log-transformed revenue), chosen
because it is a continuous numeric measure of financial performance.
Employees_log is used as the sole independent variable to
keep the relationship interpretable. A linear regression is appropriate
here since both variables are numeric and we are estimating the
magnitude of the relationship.
To split the data into training sets and test sets.
#Load necessary packages
# Split the data into training and testing sets
set.seed(123)
# Create a data partition for training and testing
train_index <- createDataPartition(Final_df_log$Revenue.in.2022_log, p = 0.8, list = FALSE)
# Extract the training and testing data
train_data <- Final_df_log[train_index, ]
test_data <- Final_df_log[-train_index, ]
# Create a simple linear regression model with "Employees_log" as the only independent variable
model <- lm(Revenue.in.2022_log ~ Employees_log, data = train_data)
# Summarize the linear regression model
model_summary <- summary(model)
# Print the model summary
print(model_summary)
##
## Call:
## lm(formula = Revenue.in.2022_log ~ Employees_log, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6931 -0.6510 -0.2719 0.4414 5.1399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06769 0.14392 -0.47 0.638
## Employees_log 0.53763 0.03956 13.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9576 on 477 degrees of freedom
## Multiple R-squared: 0.2791, Adjusted R-squared: 0.2776
## F-statistic: 184.7 on 1 and 477 DF, p-value: < 2.2e-16
# Visualize the relationship between Employees_log and Revenue.in.2022_log
plot(train_data$Employees_log, train_data$Revenue.in.2022_log, main = "Scatterplot of Employees_log vs. Revenue.in.2022_log",
xlab = "Employees_log", ylab = "Revenue.in.2022_log", col = "blue", pch = 16)
# Add the regression line to the scatterplot
abline(model, col = "red")
### Test Result
# Predictions on the test set
predictions <- predict(model, newdata = test_data)
# Model Evaluation on the test set
# Calculate Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)
rmse <- sqrt(mean((predictions - test_data$Revenue.in.2022_log)^2))
mae <- mean(abs(predictions - test_data$Revenue.in.2022_log))
# Print evaluation metrics
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 0.7910697
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 0.6073597
# Additional Model Evaluation Metrics
# Calculate R-squared
rsquared <- summary(model)$r.squared
cat("R-squared:", rsquared, "\n")
## R-squared: 0.2790829
# Calculate Adjusted R-squared
adj_rsquared <- summary(model)$adj.r.squared
cat("Adjusted R-squared:", adj_rsquared, "\n")
## Adjusted R-squared: 0.2775716
# Residual Analysis: Plot residuals vs. fitted values
residuals <- residuals(model)
fitted_values <- fitted(model)
plot(fitted_values, residuals, main = "Residuals vs. Fitted Values",
xlab = "Fitted Values", ylab = "Residuals", col = "purple", pch = 16)
# Perform Shapiro-Wilk test for normality of residuals
shapiro_test <- shapiro.test(residuals)
cat("Shapiro-Wilk test p-value:", shapiro_test$p.value, "\n")
## Shapiro-Wilk test p-value: 5.673278e-17
This section addresses Objective 2: predicting
whether a company will achieve positive revenue growth in 2023. Unlike
the regression above, the target here is
Last.Revenue.Growth converted into a binary outcome — 1 if
growth was positive, 0 otherwise. Classification is used instead of
regression because the business question is a yes/no decision, not a
precise growth estimate. The same Employees_log predictor
is used to maintain consistency with the regression analysis.
Ensure the dataset is clean and does not have missing values.
head(Final_df_log, 5)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## 1 6.109248 5.099866 0.05 F&B
## 2 6.035481 1.791759 -2.72 Electrical components
## 3 6.001415 1.791759 0.03 Electrical components
## 4 5.924256 6.715383 0.41 Retail Trade
## 6 5.010635 5.686975 -10.61 Other
## Location.Category
## 1 North & South America
## 2 Europe
## 3 Europe
## 4 Asia-Pacific
## 6 Europe
summary(Final_df_log)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## Min. :0.6931 Min. :0.000 Min. :-16.440 Length:595
## 1st Qu.:1.0986 1st Qu.:2.833 1st Qu.: -4.405 Class :character
## Median :1.3863 Median :3.258 Median : 0.780 Mode :character
## Mean :1.7724 Mean :3.448 Mean : 1.181
## 3rd Qu.:2.3979 3rd Qu.:4.052 3rd Qu.: 6.545
## Max. :6.1092 Max. :7.179 Max. : 21.880
## Location.Category
## Length:595
## Class :character
## Mode :character
##
##
##
# check for duplicate rows
duplicaterow <- sum(duplicated(df))
cat("duplicate data :", duplicaterow)
## duplicate data : 0
#check for missing data
colSums(is.na(Final_df_log))
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## 0 0 0 0
## Location.Category
## 0
This is to set the dependent variable as the binary outcome (positive revenue growth or not), and the independent variables as the numbers of employee.
# Encode categorical variable
Final_df_log$Encoded.Last.Revenue.Growth <- ifelse(Final_df_log$Last.Revenue.Growth > 0, 1, 0)
# Create factor version here — before the split so all datasets inherit it
Final_df_log$Factor.Last.Revenue.Growth <- as.factor(Final_df_log$Encoded.Last.Revenue.Growth)
# Print the updated dataset
head(Final_df_log,5)
## Revenue.in.2022_log Employees_log Last.Revenue.Growth Industry.Category
## 1 6.109248 5.099866 0.05 F&B
## 2 6.035481 1.791759 -2.72 Electrical components
## 3 6.001415 1.791759 0.03 Electrical components
## 4 5.924256 6.715383 0.41 Retail Trade
## 6 5.010635 5.686975 -10.61 Other
## Location.Category Encoded.Last.Revenue.Growth Factor.Last.Revenue.Growth
## 1 North & South America 1 1
## 2 Europe 0 0
## 3 Europe 1 1
## 4 Asia-Pacific 1 1
## 6 Europe 0 0
Split the dataset into training and testing sets.
#Count numbers of rows
number_rows <- nrow(Final_df_log)
# Print the number of rows
print(paste("Number of rows:", number_rows))
## [1] "Number of rows: 595"
# Set the seed for reproducibility
set.seed(123)
# Specify the split ratio
split <- 0.80
# Sample for the training set
train_data <- sample(seq_len(nrow(Final_df_log)), split * nrow(Final_df_log))
# Create the training set
training_set <- Final_df_log[train_data, ]
# Create the testing set by excluding the training set indices
testing_set <- Final_df_log[-train_data, ]
# Print the training and testing sets
print(paste("Training set size:", nrow(training_set)))
## [1] "Training set size: 476"
print(paste("Testing set size:", nrow(testing_set)))
## [1] "Testing set size: 119"
print(paste("Training set size:", nrow(training_set)))
## [1] "Training set size: 476"
print(paste("Testing set size:", nrow(testing_set)))
## [1] "Testing set size: 119"
# Confirm the factor column exists in both sets
cat("Factor column in training set:",
"Factor.Last.Revenue.Growth" %in% colnames(training_set), "\n")
## Factor column in training set: TRUE
cat("Factor column in testing set:",
"Factor.Last.Revenue.Growth" %in% colnames(testing_set), "\n")
## Factor column in testing set: TRUE
# Check class balance
cat("Class distribution in training set:\n")
## Class distribution in training set:
print(table(training_set$Factor.Last.Revenue.Growth))
##
## 0 1
## 212 264
# Perform logistic regression
logistic_model <- glm(Encoded.Last.Revenue.Growth ~ Employees_log, data = training_set, family = "binomial")
# Print the summary of the logistic regression model
summary(logistic_model)
##
## Call:
## glm(formula = Encoded.Last.Revenue.Growth ~ Employees_log, family = "binomial",
## data = training_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.62140 0.30536 2.035 0.0419 *
## Employees_log -0.11649 0.08416 -1.384 0.1663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 654.18 on 475 degrees of freedom
## Residual deviance: 652.26 on 474 degrees of freedom
## AIC: 656.26
##
## Number of Fisher Scoring iterations: 4
Evaluate the model’s performance on the testing dataset using appropriate metrics such as accuracy, precision, recall, and F1 score.
# Predict on the testing set
predicted_probs <- predict(logistic_model, newdata = testing_set, type = "response")
predicted_Revenue.growth <- ifelse(predicted_probs > 0.5, 1, 0)
# Evaluate the model
conf_matrix <- table(
Predicted = predicted_Revenue.growth,
Actual = testing_set$Encoded.Last.Revenue.Growth
)
print(conf_matrix)
## Actual
## Predicted 0 1
## 0 2 5
## 1 54 58
Results:
1.(True Positives) The model correctly predicted positive revenue growth in 58 instances .
2.(False Positive) The model incorrectly predicted positive revenue growth in 5 instance where there was no actual positive revenue growth .
3.(False Negative) The model failed to predict positive revenue growth in 54 instances where there was actual positive revenue growth .
4.(True Negative) The model correctly predicted no positive revenue growth in 2 instance .
# Confusion matrix values
TP <- 58
FP <- 5
FN <- 54
TN <- 2
# Calculate Accuracy
accuracy <- (TP + TN) / (TP + FP + FN + TN)
# Calculate Precision
precision <- TP / (TP + FP)
# Calculate Recall (Sensitivity or True Positive Rate)
recall <- TP / (TP + FN)
# Calculate Specificity (True Negative Rate)
specificity <- TN / (TN + FP)
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print the results
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.5042"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 0.9206"
print(paste("Recall:", round(recall, 4)))
## [1] "Recall: 0.5179"
print(paste("Specificity:", round(specificity, 4)))
## [1] "Specificity: 0.2857"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 0.6629"
cv_control <- trainControl(method = "cv", number = 5, classProbs = FALSE)
Final_df_log$CV.Target <- as.factor(Final_df_log$Encoded.Last.Revenue.Growth)
cv_model <- train(
CV.Target ~ Employees_log,
data = Final_df_log,
method = "glm",
family = "binomial",
trControl = cv_control
)
print(cv_model)
## Generalized Linear Model
##
## 595 samples
## 1 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 476, 475, 476, 476, 477
## Resampling results:
##
## Accuracy Kappa
## 0.5428586 -0.009902522
cat("Cross-validated Accuracy:", round(max(cv_model$results$Accuracy), 4), "\n")
## Cross-validated Accuracy: 0.5429
# Train the Random Forest model
rf_model <- randomForest::randomForest(Factor.Last.Revenue.Growth ~ Employees_log, data = training_set)
# Make predictions on the testing set
predicted_Revenue.growth <- predict(rf_model, newdata = testing_set, type = "response")
# Evaluate the model (e.g., confusion matrix, accuracy, etc.)
conf_matrix <- table(predicted_Revenue.growth, testing_set$Factor.Last.Revenue.Growth)
# Create a new confusion matrix with updated labels
rf_conf_matrix <- conf_matrix
# Rename rows (predicted labels)
rownames(rf_conf_matrix) <- c("Actual 0", "Actual 1")
# Rename columns (true labels)
colnames(rf_conf_matrix) <- c("Predicted 0", "Predicted 1")
# Print the updated confusion matrix
print(rf_conf_matrix)
##
## predicted_Revenue.growth Predicted 0 Predicted 1
## Actual 0 27 27
## Actual 1 29 36
cv_control <- trainControl(method = "cv", number = 5, classProbs = FALSE)
Final_df_log$CV.Target <- as.factor(Final_df_log$Encoded.Last.Revenue.Growth)
cv_model <- train(
CV.Target ~ Employees_log,
data = Final_df_log,
method = "rf",
family = "binomial",
trControl = cv_control
)
print(cv_model)
## Random Forest
##
## 595 samples
## 1 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 477, 476, 475, 477, 475
## Resampling results:
##
## Accuracy Kappa
## 0.5345544 0.05979448
##
## Tuning parameter 'mtry' was held constant at a value of 2
cat("Cross-validated Accuracy:", round(max(cv_model$results$Accuracy), 4), "\n")
## Cross-validated Accuracy: 0.5346
Results:
1.(True Positives) The model correctly predicted positive revenue growth in 37 instances .
2.(False Positive) The model incorrectly predicted positive revenue growth in 26 instance where there was no actual positive revenue growth .
3.(False Negative) The model failed to predict positive revenue growth in 30 instances where there was actual positive revenue growth .
4.(True Negative) The model correctly predicted no positive revenue growth in 26 instance .
# Confusion matrix values
TP <- 36
FP <- 27
FN <- 29
TN <- 27
# Calculate Accuracy
accuracy <- (TP + TN) / (TP + FP + FN + TN)
# Calculate Precision
precision <- TP / (TP + FP)
# Calculate Recall (Sensitivity or True Positive Rate)
recall <- TP / (TP + FN)
# Calculate Specificity (True Negative Rate)
specificity <- TN / (TN + FP)
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print the results
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.5294"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 0.5714"
print(paste("Recall:", round(recall, 4)))
## [1] "Recall: 0.5538"
print(paste("Specificity:", round(specificity, 4)))
## [1] "Specificity: 0.5"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 0.5625"
# Train a Naive Bayes model
naive_bayes_model <- naiveBayes(Encoded.Last.Revenue.Growth ~ Employees_log, data = training_set)
# Make predictions on the testing set
predicted_probs <- predict(naive_bayes_model, testing_set, type = "raw")
# Convert predicted probabilities to labels
predicted_Revenue.growth <- predict(naive_bayes_model, testing_set, type = "class")
# Create a confusion matrix
conf_matrix <- table(predicted_Revenue.growth,testing_set$Encoded.Last.Revenue.Growth)
# Create a new confusion matrix with updated labels
nb_conf_matrix <- conf_matrix
# Rename rows (predicted labels)
rownames(nb_conf_matrix) <- c("Actual 0", "Actual 1")
# Rename columns (true labels)
colnames(nb_conf_matrix) <- c("Predicted 0", "Predicted 1")
# Print the updated confusion matrix
print(nb_conf_matrix)
##
## predicted_Revenue.growth Predicted 0 Predicted 1
## Actual 0 0 0
## Actual 1 56 63
Results:
1.(True Positives) The model correctly predicted positive revenue growth in 56 instances .
2.(False Positive) The model incorrectly predicted positive revenue growth in 0 instance where there was no actual positive revenue growth .
3.(False Negative) The model failed to predict positive revenue growth in 63 instances where there was actual positive revenue growth .
4.(True Negative) The model correctly predicted no positive revenue growth in 0 instance .
# Confusion matrix values
TP <- 56
FP <- 0
FN <- 63
TN <- 0
# Calculate Accuracy
accuracy <- (TP + TN) / (TP + FP + FN + TN)
# Calculate Precision
precision <- TP / (TP + FP)
# Calculate Recall (Sensitivity or True Positive Rate)
recall <- TP / (TP + FN)
# Calculate Specificity (True Negative Rate)
specificity <- TN / (TN + FP)
# Calculate F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print the results
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.4706"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 1"
print(paste("Recall:", round(recall, 4)))
## [1] "Recall: 0.4706"
print(paste("Specificity:", round(specificity, 4)))
## [1] "Specificity: NaN"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 0.64"
Several classification model is trained (Logistic Regression, Random Forest, Naives Bayes) to predict the likelihood of positive revenue growth in 2023. The accuracy of each model are 0.5042, 0.5294, 0.4706 respectively.
In conclusion,the regression model confirms a statistically significant positive relationship between employee count and revenue (R² = 0.28, RMSE = 0.79). However, R² = 0.28 means employee count explains only 28% of the variation in revenue — the remaining 72% is driven by factors not captured in this model. Employee count alone is therefore a weak practical predictor, and future work should incorporate additional features such as industry, location, and company age.
The 2 objectives are answered even the model result are not producing satisfactory result. High predictive accuracy is not feasible mainly due to the data limitation. Even though both model doesn’t produce satisfactory results but it’s important to conduct a thorough evaluation and analysis to understand the limitations and potential areas for improvement.
Action for further analysis:
Re-evaluate Data Quality: Data cleaning and pre-processing
Feature Engineering: select the correct feature based on the correlation matrix.