The objective is to select two datasets (recommended one small, one large) from the following website.
https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining you selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
The purpose of data exploration is to explore, learn, and understand the characteristics of the variables. In addition, it helps us to identify patterns or potential problems in the data.
#Import required libraries
library(ggplot2)
library(dplyr)
library(tsibble)
library(feasts)
library(rpart)
library(caTools)
#Import data 1 with 5000 records
Data_5000 <- read.csv("5000 Sales Records.csv")
head(Data_5000)
## Region Country Item.Type
## 1 Central America and the Caribbean Antigua and Barbuda Baby Food
## 2 Central America and the Caribbean Panama Snacks
## 3 Europe Czech Republic Beverages
## 4 Asia North Korea Cereal
## 5 Asia Sri Lanka Snacks
## 6 Middle East and North Africa Morocco Personal Care
## Sales.Channel Order.Priority Order.Date Order.ID Ship.Date Units.Sold
## 1 Online M 12/20/2013 957081544 1/11/2014 552
## 2 Offline C 7/5/2010 301644504 7/26/2010 2167
## 3 Offline C 9/12/2011 478051030 9/29/2011 4778
## 4 Offline L 5/13/2010 892599952 6/15/2010 9016
## 5 Offline C 7/20/2015 571902596 7/27/2015 7542
## 6 Offline L 11/8/2010 412882792 11/22/2010 48
## Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 255.28 159.42 140914.56 87999.84 52914.72
## 2 152.58 97.44 330640.86 211152.48 119488.38
## 3 47.45 31.79 226716.10 151892.62 74823.48
## 4 205.70 117.11 1854591.20 1055863.76 798727.44
## 5 152.58 97.44 1150758.36 734892.48 415865.88
## 6 81.73 56.67 3923.04 2720.16 1202.88
We can see that the data contain 14 variables/columns and 5000 observations/records. The target variable is the “Total.Profit”. The dateset has numerical and categorical variables.
#Check the Summary
summary(Data_5000)
## Region Country Item.Type Sales.Channel
## Length:5000 Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:5000 Length:5000 Min. :100090873 Length:5000
## Class :character Class :character 1st Qu.:320104217 Class :character
## Mode :character Mode :character Median :552314960 Mode :character
## Mean :548644737
## 3rd Qu.:768770944
## Max. :999879729
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 2 Min. : 9.33 Min. : 6.92 Min. : 65
## 1st Qu.:2453 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 257417
## Median :5123 Median :154.06 Median : 97.44 Median : 779409
## Mean :5031 Mean :265.75 Mean :187.49 Mean :1325738
## 3rd Qu.:7576 3rd Qu.:437.20 3rd Qu.:263.33 3rd Qu.:1839975
## Max. :9999 Max. :668.27 Max. :524.96 Max. :6672676
## Total.Cost Total.Profit
## Min. : 48 Min. : 16.9
## 1st Qu.: 154748 1st Qu.: 85339.3
## Median : 468181 Median : 279095.2
## Mean : 933093 Mean : 392644.6
## 3rd Qu.:1189578 3rd Qu.: 565106.4
## Max. :5248025 Max. :1726007.5
From the summary function, we get some useful information such as the minimum, maximum, mean, median, 1st and 3rd quartile.
str(Data_5000)
## 'data.frame': 5000 obs. of 14 variables:
## $ Region : chr "Central America and the Caribbean" "Central America and the Caribbean" "Europe" "Asia" ...
## $ Country : chr "Antigua and Barbuda " "Panama" "Czech Republic" "North Korea" ...
## $ Item.Type : chr "Baby Food" "Snacks" "Beverages" "Cereal" ...
## $ Sales.Channel : chr "Online" "Offline" "Offline" "Offline" ...
## $ Order.Priority: chr "M" "C" "C" "L" ...
## $ Order.Date : chr "12/20/2013" "7/5/2010" "9/12/2011" "5/13/2010" ...
## $ Order.ID : int 957081544 301644504 478051030 892599952 571902596 412882792 932776868 919133651 579814469 192993152 ...
## $ Ship.Date : chr "1/11/2014" "7/26/2010" "9/29/2011" "6/15/2010" ...
## $ Units.Sold : int 552 2167 4778 9016 7542 48 8258 927 8841 9817 ...
## $ Unit.Price : num 255.3 152.6 47.5 205.7 152.6 ...
## $ Unit.Cost : num 159.4 97.4 31.8 117.1 97.4 ...
## $ Total.Revenue : num 140915 330641 226716 1854591 1150758 ...
## $ Total.Cost : num 88000 211152 151893 1055864 734892 ...
## $ Total.Profit : num 52915 119488 74823 798727 415866 ...
The str function describes the structure of the data. We will have to change/convert most of the features.
Since we have some ‘chr’ features, let’s check the levels of the categorical variables.
#Order.Priority
#Since the variables are not recorded as factor level, we need to use factor.
levels(factor(Data_5000$Order.Priority))
## [1] "C" "H" "L" "M"
There are four levels in Order.Priority, C, H, L, and M.
#Sales.Channel
levels(factor(Data_5000$Sales.Channel))
## [1] "Offline" "Online"
There are TWO levels Sales.Channel, Offline and Online.
#Item.Type
levels(factor(Data_5000$Item.Type))
## [1] "Baby Food" "Beverages" "Cereal" "Clothes"
## [5] "Cosmetics" "Fruits" "Household" "Meat"
## [9] "Office Supplies" "Personal Care" "Snacks" "Vegetables"
There are 12 levels in Item.Type as shown above. Item Types could be used to create subsets to identify the outliers for both the unit.price and the unit.cost by calculating and comparing the mean and median for each column individual as the item type has affect both.
#Region
levels(factor(Data_5000$Region))
## [1] "Asia" "Australia and Oceania"
## [3] "Central America and the Caribbean" "Europe"
## [5] "Middle East and North Africa" "North America"
## [7] "Sub-Saharan Africa"
There are 7 levels in Item.Type as shown above.
#Search if there are any NA values
sum(is.na(Data_5000))
## [1] 0
The data does not have any missing values.
Now, let’s explore variables.
#Average Total profit per order for each region
Data_5000$Region <- as.factor(Data_5000$Region)
Region <- aggregate(Data_5000$Total.Profit, by=list(Group=Data_5000$Region), FUN=mean)
Region
## Group x
## 1 Asia 386978.7
## 2 Australia and Oceania 420515.8
## 3 Central America and the Caribbean 383168.5
## 4 Europe 377200.0
## 5 Middle East and North Africa 379312.0
## 6 North America 389060.4
## 7 Sub-Saharan Africa 413340.3
##Calculating the Total profit for each region
Data_5000$Region <- as.factor(Data_5000$Region)
Region <- aggregate(Data_5000$Total.Profit, by=list(Group=Data_5000$Region), FUN=sum)
Region
## Group x
## 1 Asia 278237662
## 2 Australia and Oceania 174934573
## 3 Central America and the Caribbean 204611980
## 4 Europe 501676048
## 5 Middle East and North Africa 231380301
## 6 North America 41240398
## 7 Sub-Saharan Africa 531142268
Sub-Saharan Africa has the highest profit, followed by Europe while North America has the least profit.
Data_5000 %>%
group_by(Region) %>%
summarise(Total.Profit = sum(Total.Profit))%>%
ggplot(aes(x = Region, y = Total.Profit, fill = Region)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(
x = "Region",
y = "Total.Profit",
title = paste("Total profit for each Region"))+
coord_flip()
Let’s check what makes Sub-Saharan Africa has the highest profit.
##Calculating the unit sold for each region
Data_5000 %>% count(Region)
## Region n
## 1 Asia 719
## 2 Australia and Oceania 416
## 3 Central America and the Caribbean 534
## 4 Europe 1330
## 5 Middle East and North Africa 610
## 6 North America 106
## 7 Sub-Saharan Africa 1285
It makes sense that Sub-Saharan Africa has the highest profit, followed by Europe because the number of orders for both are the highest among others.
#Explore the number of sales for each Channel.
Data_5000 %>% count(Sales.Channel)
## Sales.Channel n
## 1 Offline 2504
## 2 Online 2496
The number of sales online is 2496 and the number of sales offline is 2504. There is not a huge difference.
#Check the profit for each channel
Data_5000 %>%
group_by(Sales.Channel) %>%
summarise(Total.Profit = mean(Total.Profit))
## # A tibble: 2 x 2
## Sales.Channel Total.Profit
## <chr> <dbl>
## 1 Offline 396809.
## 2 Online 388467.
It looks like the profit for Offline channel is little higher but close to the online channel.
#Explore the number of Item.Type.
Data_5000 %>% count(Item.Type)
## Item.Type n
## 1 Baby Food 445
## 2 Beverages 447
## 3 Cereal 385
## 4 Clothes 386
## 5 Cosmetics 424
## 6 Fruits 447
## 7 Household 424
## 8 Meat 399
## 9 Office Supplies 420
## 10 Personal Care 415
## 11 Snacks 398
## 12 Vegetables 410
Data_5000 %>%
group_by(Item.Type) %>%
summarise(Units.Sold = mean(Units.Sold))
## # A tibble: 12 x 2
## Item.Type Units.Sold
## <chr> <dbl>
## 1 Baby Food 5112.
## 2 Beverages 4940.
## 3 Cereal 5064.
## 4 Clothes 5138.
## 5 Cosmetics 5052.
## 6 Fruits 5237.
## 7 Household 4951.
## 8 Meat 4765.
## 9 Office Supplies 5018.
## 10 Personal Care 5144.
## 11 Snacks 4827.
## 12 Vegetables 5098.
The most units sold are Fruits, followed by personal Care items.
Data_5000 %>%
group_by(Item.Type) %>%
summarise(Units.Sold = sum(Units.Sold))
## # A tibble: 12 x 2
## Item.Type Units.Sold
## <chr> <int>
## 1 Baby Food 2274921
## 2 Beverages 2208169
## 3 Cereal 1949601
## 4 Clothes 1983445
## 5 Cosmetics 2141909
## 6 Fruits 2341083
## 7 Household 2099238
## 8 Meat 1901197
## 9 Office Supplies 2107628
## 10 Personal Care 2134895
## 11 Snacks 1921075
## 12 Vegetables 2090330
Data_5000 %>%
group_by(Item.Type) %>%
summarise(Total.Profit = sum(Total.Profit))
## # A tibble: 12 x 2
## Item.Type Total.Profit
## <chr> <dbl>
## 1 Baby Food 218073927.
## 2 Beverages 34579927.
## 3 Cereal 172715153.
## 4 Clothes 145664201.
## 5 Cosmetics 372413718.
## 6 Fruits 5642010.
## 7 Household 347906714.
## 8 Meat 108748468.
## 9 Office Supplies 266088035
## 10 Personal Care 53500469.
## 11 Snacks 105928076.
## 12 Vegetables 131962533.
The most profitable items are Cosmetics, followed by Household items.
The least profitable items are Fruits although they have the most units sold.
Data_5000 %>%
group_by(Order.Priority) %>%
summarise(Units.Sold = mean(Units.Sold))
## # A tibble: 4 x 2
## Order.Priority Units.Sold
## <chr> <dbl>
## 1 C 5048.
## 2 H 5011.
## 3 L 4893.
## 4 M 5162.
Data_5000 %>%
group_by(Order.Priority) %>%
summarise(Total.Profit = mean(Total.Profit))
## # A tibble: 4 x 2
## Order.Priority Total.Profit
## <chr> <dbl>
## 1 C 397007.
## 2 H 398402.
## 3 L 371385.
## 4 M 402945.
Since Order priority is not clear to me and the total profit is almost the same for all types. Let’s explore if there is any other factor relates/affect the profit the priority.
Data_5000 %>%
group_by(Order.Priority) %>%
summarise(Unit.Price = mean(Unit.Price))
## # A tibble: 4 x 2
## Order.Priority Unit.Price
## <chr> <dbl>
## 1 C 276.
## 2 H 268.
## 3 L 260.
## 4 M 260.
The unit price is almost the same for each order priority but at least we see that class M has higher profit and most units sold but also very close to Order Priority C and H. Class L has the least sales and profit.
# Convert date columns to date datatypes
Data_5000[['Order.Date']] <- as.Date(Data_5000[['Order.Date']], "%m/%d/%Y")
Data_5000[['Ship.Date']] <- as.Date(Data_5000[['Ship.Date']], "%m/%d/%Y")
Data_5000$Diff_Dates <- Data_5000$Ship.Date - Data_5000$Order.Date
Data_5000$Diff_Dates <- as.numeric(Data_5000$Diff_Dates)
Data_5000 %>%
group_by(Order.Priority) %>%
summarise(Diff_Dates = mean(Diff_Dates))
## # A tibble: 4 x 2
## Order.Priority Diff_Dates
## <chr> <dbl>
## 1 C 24.7
## 2 H 25.3
## 3 L 24.9
## 4 M 25.2
Difference in the time between the order date and ship date is almost the same and does not impact the order priority.
We obviously notice that Ship.Date is dependent on the order date and both are correlated positively. Therefore, we will remove the ship date column
#install.packages("GGally")
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggcorr(Data_5000, label = TRUE)
## Warning in ggcorr(Data_5000, label = TRUE): data in column(s) 'Region',
## 'Country', 'Item.Type', 'Sales.Channel', 'Order.Priority', 'Order.Date',
## 'Ship.Date' are not numeric and were ignored
From the correlation plot above, we can include that Order.ID and Diff_Dates are not useful in the data. In addition the ship date is not also useful as the order date. Therefore, we will remove the Order.ID, Diff_Dates, and ship.Date.
#Remove column Ship.Date.
Data_5000 <- Data_5000[ , -which(names(Data_5000) %in% c("Ship.Date"))]
#Remove column Order.ID
Data_5000 <- Data_5000[ , -which(names(Data_5000) %in% c("Order.ID"))]
#Remove column Diff_Dates
Data_5000 <- Data_5000[ , -which(names(Data_5000) %in% c("Diff_Dates"))]
#Sales.Channel: 0 for Offline and 1 for Online
Data_5000$Sales.Channel = factor(Data_5000$Sales.Channel, levels = c('Offline','Online'),
labels = c('0','1'))
#Order.Priority: 0 for C, 1 for H, 2 for L, and 3 for M.
Data_5000$Order.Priority = factor(Data_5000$Order.Priority, levels = c('C','H','L', 'M'),
labels = c('0','1','2', '3'))
#Region: 0 for Asia, 1 for Australia and Oceania, 2 for Central America and the Caribbean, 3 for Europe, 4 for Middle East and North Africa, 5 for North America, and 6 for Sub-Saharan Africa
Data_5000$Region = factor(Data_5000$Region, levels = c('Asia','Australia and Oceania','Central America and the Caribbean', 'Europe', 'Middle East and North Africa', ' North America','Sub-Saharan Africa'), labels = c('0','1','2', '3', '4', '5', '6'))
We are dealing with business data that has transaction records including revenues, profits and other important variables. Obviously, the profit is a dependent variable and relies on other factors such as, units.sold, channels, revenues, and even country. Therefore, I will consider the Total.profit variable as the output in the data. Machine learning algorithms can be used to predict the profitability of single of multiple transaction.
Two machine learning algorithms can be applied to these two datasets:
Linear regression algorithms.
Decision tree
Difference between Linear regression vs Decision Tree:
When there are large number of features with less data-sets(with low noise), linear regressions may outperform Decision trees/random forests. In general cases, Decision trees will be having better average accuracy.
For categorical independent variables, decision trees are better than linear regression. Decision trees handles colinearity better than LR.
I will train the LR and DT on the train data and test it on the test data to determine the MSE which is the indicator to be used to determine the chosen algorithms.
Data_5k <- Data_5000
#Split the Data to training and testing data to conduct a regression model
set.seed(123)
split=sample.split(Data_5k$Total.Profit, SplitRatio = 0.75)
Train_Data=subset(Data_5k,split == TRUE)
Test_Data=subset(Data_5k,split == FALSE)
#Perform Linear regression model
set.seed(123)
LR_model <- lm(Total.Profit ~ Item.Type +
Units.Sold +
Sales.Channel +
Order.Priority +
Country
, data=Train_Data)
#summary(LR_model)
#predicted values for test data
x_test= subset(Test_Data, select = -c(Total.Profit))
test.pred.regression <- predict(LR_model, newdata = x_test)
# Train RMSE
print("Linear regression MSE :")
## [1] "Linear regression MSE :"
print(sqrt(mean((test.pred.regression - Test_Data$Total.Profit)^2)))
## [1] 155764.3
# DTR
set.seed(123)
rpart.control(maxdepth = 30)
## $minsplit
## [1] 20
##
## $minbucket
## [1] 7
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 30
##
## $xval
## [1] 10
fit <- rpart( Total.Profit ~ Item.Type + Region + Sales.Channel + Order.Priority + Country + Units.Sold, method = "anova", data = Train_Data)
# detailed summary of splits
#summary(fit)
# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results
##
## Regression tree:
## rpart(formula = Total.Profit ~ Item.Type + Region + Sales.Channel +
## Order.Priority + Country + Units.Sold, data = Train_Data,
## method = "anova")
##
## Variables actually used in tree construction:
## [1] Item.Type Units.Sold
##
## Root node error: 5.3989e+14/3750 = 1.4397e+11
##
## n= 3750
##
## CP nsplit rel error xerror xstd
## 1 0.342214 0 1.000000 1.000467 0.0289225
## 2 0.283656 1 0.657786 0.662435 0.0175066
## 3 0.076726 2 0.374130 0.373971 0.0083564
## 4 0.053013 4 0.220678 0.230813 0.0067690
## 5 0.035541 5 0.167665 0.169710 0.0044695
## 6 0.027597 6 0.132124 0.134139 0.0037096
## 7 0.014504 7 0.104527 0.106810 0.0027778
## 8 0.013645 8 0.090023 0.096990 0.0025250
## 9 0.010000 9 0.076378 0.081018 0.0018113
# plot tree
plot(fit, uniform=TRUE,
main="Regression Tree for Total profit ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postcript plot of tree
post(fit, title = "Regression Tree for Total profit ")
#predicted values for test data
DTR_pred <- predict(fit, newdata = x_test)
print("DTR MSE :")
## [1] "DTR MSE :"
print(sqrt(mean((DTR_pred - Test_Data$Total.Profit)^2)))
## [1] 107424.9
We decide to select Decision tree because the MSE is lower than the MSE of the LR BY ~30%.
#Import data 1 with 5000 records
Data_10000 <- read.csv("10000 Sales Records.csv")
#Sales.Channel: 0 for Offline and 1 for Online
Data_10000$Sales.Channel = factor(Data_10000$Sales.Channel, levels = c('Offline','Online'),
labels = c('0','1'))
#Order.Priority: 0 for C, 1 for H, 2 for L, and 3 for M.
Data_10000$Order.Priority = factor(Data_10000$Order.Priority, levels = c('C','H','L', 'M'),
labels = c('0','1','2', '3'))
#Region: 0 for Asia, 1 for Australia and Oceania, 2 for Central America and the Caribbean, 3 for Europe, 4 for Middle East and North Africa, 5 for North America, and 6 for Sub-Saharan Africa
Data_10000$Region = factor(Data_10000$Region, levels = c('Asia','Australia and Oceania','Central America and the Caribbean', 'Europe', 'Middle East and North Africa', ' North America','Sub-Saharan Africa'), labels = c('0','1','2', '3', '4', '5', '6'))
#Remove column Ship.Date.
Data_10000 <- Data_10000[ , -which(names(Data_10000) %in% c("Ship.Date"))]
#Remove column Order.ID
Data_10000 <- Data_10000[ , -which(names(Data_10000) %in% c("Order.ID"))]
#Split the Data to training and testing data to conduct a regression model
set.seed(123)
split=sample.split(Data_10000$Total.Profit, SplitRatio = 0.75)
Train_Data1=subset(Data_10000,split == TRUE)
Test_Data1=subset(Data_10000,split == FALSE)
x_test1= subset(Test_Data1, select = -c(Total.Profit))
# DTR
set.seed(123)
rpart.control(maxdepth = 30)
## $minsplit
## [1] 20
##
## $minbucket
## [1] 7
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 30
##
## $xval
## [1] 10
fit2 <- rpart( Total.Profit ~ Item.Type + Region + Sales.Channel + Order.Priority + Country + Units.Sold, method = "anova", data = Train_Data1)
# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit2) # visualize cross-validation results
##
## Regression tree:
## rpart(formula = Total.Profit ~ Item.Type + Region + Sales.Channel +
## Order.Priority + Country + Units.Sold, data = Train_Data1,
## method = "anova")
##
## Variables actually used in tree construction:
## [1] Item.Type Units.Sold
##
## Root node error: 1.0573e+15/7500 = 1.4097e+11
##
## n= 7500
##
## CP nsplit rel error xerror xstd
## 1 0.331633 0 1.000000 1.000198 0.0209569
## 2 0.267095 1 0.668367 0.668699 0.0115365
## 3 0.107173 2 0.401272 0.402242 0.0057911
## 4 0.034189 4 0.186926 0.188674 0.0036675
## 5 0.033383 5 0.152737 0.166476 0.0032248
## 6 0.018609 6 0.119354 0.121886 0.0021057
## 7 0.014744 7 0.100745 0.102781 0.0017518
## 8 0.012856 8 0.086001 0.087431 0.0016839
## 9 0.010000 9 0.073146 0.073617 0.0012078
# plot tree
plot(fit2, uniform=TRUE,
main="Regression Tree for Total profit ")
text(fit2, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postcript plot of tree
post(fit2, title = "Regression Tree for Total profit ")
#predicted values for test data
DTR_pred2 <- predict(fit2, newdata = x_test1)
print("DTR MSE :")
## [1] "DTR MSE :"
print(sqrt(mean((DTR_pred2 - Test_Data1$Total.Profit)^2)))
## [1] 103969.7
Should we increase the production of a specific item in a specific region? or introduce similar items?
Should we advertise using Online channels or open other physical locations in region(s) based on the item type?
In order to answer these questions, we need to use data with plus 1M records as we are going to split the data into multiple subsets based on the region.
These questions can help business planning more effective and hopefully lead to more profitability.