DATA 622 Assignment 1

Objective

The objective is to select two datasets (recommended one small, one large) from the following website.

https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets. Write a short essay explaining you selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

Data 1 (5000 Records)

Exploratory Data Analysis

The purpose of data exploration is to explore, learn, and understand the characteristics of the variables. In addition, it helps us to identify patterns or potential problems in the data.

Data Summery

#Import required libraries
library(ggplot2)
library(dplyr)
library(tsibble)
library(feasts)
library(rpart)
library(caTools)

#Import data 1 with 5000 records
Data_5000 <- read.csv("5000 Sales Records.csv")
head(Data_5000)

##                              Region              Country     Item.Type
## 1 Central America and the Caribbean Antigua and Barbuda      Baby Food
## 2 Central America and the Caribbean               Panama        Snacks
## 3                            Europe       Czech Republic     Beverages
## 4                              Asia          North Korea        Cereal
## 5                              Asia            Sri Lanka        Snacks
## 6      Middle East and North Africa              Morocco Personal Care
##   Sales.Channel Order.Priority Order.Date  Order.ID  Ship.Date Units.Sold
## 1        Online              M 12/20/2013 957081544  1/11/2014        552
## 2       Offline              C   7/5/2010 301644504  7/26/2010       2167
## 3       Offline              C  9/12/2011 478051030  9/29/2011       4778
## 4       Offline              L  5/13/2010 892599952  6/15/2010       9016
## 5       Offline              C  7/20/2015 571902596  7/27/2015       7542
## 6       Offline              L  11/8/2010 412882792 11/22/2010         48
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     255.28    159.42     140914.56   87999.84     52914.72
## 2     152.58     97.44     330640.86  211152.48    119488.38
## 3      47.45     31.79     226716.10  151892.62     74823.48
## 4     205.70    117.11    1854591.20 1055863.76    798727.44
## 5     152.58     97.44    1150758.36  734892.48    415865.88
## 6      81.73     56.67       3923.04    2720.16      1202.88

We can see that the data contain 14 variables/columns and 5000 observations/records. The target variable is the “Total.Profit”. The dateset has numerical and categorical variables.

#Check the Summary
summary(Data_5000)

##     Region            Country           Item.Type         Sales.Channel     
##  Length:5000        Length:5000        Length:5000        Length:5000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:5000        Length:5000        Min.   :100090873   Length:5000       
##  Class :character   Class :character   1st Qu.:320104217   Class :character  
##  Mode  :character   Mode  :character   Median :552314960   Mode  :character  
##                                        Mean   :548644737                     
##                                        3rd Qu.:768770944                     
##                                        Max.   :999879729                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :   2   Min.   :  9.33   Min.   :  6.92   Min.   :     65  
##  1st Qu.:2453   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 257417  
##  Median :5123   Median :154.06   Median : 97.44   Median : 779409  
##  Mean   :5031   Mean   :265.75   Mean   :187.49   Mean   :1325738  
##  3rd Qu.:7576   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:1839975  
##  Max.   :9999   Max.   :668.27   Max.   :524.96   Max.   :6672676  
##    Total.Cost       Total.Profit      
##  Min.   :     48   Min.   :     16.9  
##  1st Qu.: 154748   1st Qu.:  85339.3  
##  Median : 468181   Median : 279095.2  
##  Mean   : 933093   Mean   : 392644.6  
##  3rd Qu.:1189578   3rd Qu.: 565106.4  
##  Max.   :5248025   Max.   :1726007.5

From the summary function, we get some useful information such as the minimum, maximum, mean, median, 1st and 3rd quartile.

str(Data_5000)

## 'data.frame':    5000 obs. of  14 variables:
##  $ Region        : chr  "Central America and the Caribbean" "Central America and the Caribbean" "Europe" "Asia" ...
##  $ Country       : chr  "Antigua and Barbuda " "Panama" "Czech Republic" "North Korea" ...
##  $ Item.Type     : chr  "Baby Food" "Snacks" "Beverages" "Cereal" ...
##  $ Sales.Channel : chr  "Online" "Offline" "Offline" "Offline" ...
##  $ Order.Priority: chr  "M" "C" "C" "L" ...
##  $ Order.Date    : chr  "12/20/2013" "7/5/2010" "9/12/2011" "5/13/2010" ...
##  $ Order.ID      : int  957081544 301644504 478051030 892599952 571902596 412882792 932776868 919133651 579814469 192993152 ...
##  $ Ship.Date     : chr  "1/11/2014" "7/26/2010" "9/29/2011" "6/15/2010" ...
##  $ Units.Sold    : int  552 2167 4778 9016 7542 48 8258 927 8841 9817 ...
##  $ Unit.Price    : num  255.3 152.6 47.5 205.7 152.6 ...
##  $ Unit.Cost     : num  159.4 97.4 31.8 117.1 97.4 ...
##  $ Total.Revenue : num  140915 330641 226716 1854591 1150758 ...
##  $ Total.Cost    : num  88000 211152 151893 1055864 734892 ...
##  $ Total.Profit  : num  52915 119488 74823 798727 415866 ...

The str function describes the structure of the data. We will have to change/convert most of the features.

Since we have some ‘chr’ features, let’s check the levels of the categorical variables.

Checking unique values

#Order.Priority
#Since the variables are not recorded as factor level, we need to use factor. 

levels(factor(Data_5000$Order.Priority))

## [1] "C" "H" "L" "M"

There are four levels in Order.Priority, C, H, L, and M.

#Sales.Channel
levels(factor(Data_5000$Sales.Channel))

## [1] "Offline" "Online"

There are TWO levels Sales.Channel, Offline and Online.

#Item.Type
levels(factor(Data_5000$Item.Type))

##  [1] "Baby Food"       "Beverages"       "Cereal"          "Clothes"        
##  [5] "Cosmetics"       "Fruits"          "Household"       "Meat"           
##  [9] "Office Supplies" "Personal Care"   "Snacks"          "Vegetables"

There are 12 levels in Item.Type as shown above. Item Types could be used to create subsets to identify the outliers for both the unit.price and the unit.cost by calculating and comparing the mean and median for each column individual as the item type has affect both.

#Region
levels(factor(Data_5000$Region))

## [1] "Asia"                              "Australia and Oceania"            
## [3] "Central America and the Caribbean" "Europe"                           
## [5] "Middle East and North Africa"      "North America"                    
## [7] "Sub-Saharan Africa"

There are 7 levels in Item.Type as shown above.

Missing Vlaues

#Search if there are any NA values

sum(is.na(Data_5000))

## [1] 0

The data does not have any missing values.

Now, let’s explore variables.

Region

#Average Total profit per order for each region
Data_5000$Region <- as.factor(Data_5000$Region)
Region <- aggregate(Data_5000$Total.Profit, by=list(Group=Data_5000$Region), FUN=mean)
Region

##                               Group        x
## 1                              Asia 386978.7
## 2             Australia and Oceania 420515.8
## 3 Central America and the Caribbean 383168.5
## 4                            Europe 377200.0
## 5      Middle East and North Africa 379312.0
## 6                     North America 389060.4
## 7                Sub-Saharan Africa 413340.3

##Calculating the Total profit for each region
Data_5000$Region <- as.factor(Data_5000$Region)
Region <- aggregate(Data_5000$Total.Profit, by=list(Group=Data_5000$Region), FUN=sum)
Region

##                               Group         x
## 1                              Asia 278237662
## 2             Australia and Oceania 174934573
## 3 Central America and the Caribbean 204611980
## 4                            Europe 501676048
## 5      Middle East and North Africa 231380301
## 6                     North America  41240398
## 7                Sub-Saharan Africa 531142268

Sub-Saharan Africa has the highest profit, followed by Europe while North America has the least profit.

Data_5000 %>%
    group_by(Region) %>%
    summarise(Total.Profit = sum(Total.Profit))%>%
  ggplot(aes(x = Region, y = Total.Profit, fill = Region)) +
    geom_bar(stat = "identity") +
    theme_classic() +
    labs(
        x = "Region",
        y = "Total.Profit",
        title = paste("Total profit for each Region"))+
    coord_flip()

Let’s check what makes Sub-Saharan Africa has the highest profit.

##Calculating the unit sold for each region
Data_5000 %>% count(Region)

##                              Region    n
## 1                              Asia  719
## 2             Australia and Oceania  416
## 3 Central America and the Caribbean  534
## 4                            Europe 1330
## 5      Middle East and North Africa  610
## 6                     North America  106
## 7                Sub-Saharan Africa 1285

It makes sense that Sub-Saharan Africa has the highest profit, followed by Europe because the number of orders for both are the highest among others.

Sales.Channel

#Explore the number of sales for each Channel.

Data_5000 %>% count(Sales.Channel)

##   Sales.Channel    n
## 1       Offline 2504
## 2        Online 2496

The number of sales online is 2496 and the number of sales offline is 2504. There is not a huge difference.

#Check the profit for each channel
Data_5000 %>%
    group_by(Sales.Channel) %>%
    summarise(Total.Profit = mean(Total.Profit))

## # A tibble: 2 x 2
##   Sales.Channel Total.Profit
##   <chr>                <dbl>
## 1 Offline            396809.
## 2 Online             388467.

It looks like the profit for Offline channel is little higher but close to the online channel.

Item.Type

#Explore the number of Item.Type.
Data_5000 %>% count(Item.Type)

##          Item.Type   n
## 1        Baby Food 445
## 2        Beverages 447
## 3           Cereal 385
## 4          Clothes 386
## 5        Cosmetics 424
## 6           Fruits 447
## 7        Household 424
## 8             Meat 399
## 9  Office Supplies 420
## 10   Personal Care 415
## 11          Snacks 398
## 12      Vegetables 410

Data_5000 %>%
    group_by(Item.Type) %>%
    summarise(Units.Sold = mean(Units.Sold))

## # A tibble: 12 x 2
##    Item.Type       Units.Sold
##    <chr>                <dbl>
##  1 Baby Food            5112.
##  2 Beverages            4940.
##  3 Cereal               5064.
##  4 Clothes              5138.
##  5 Cosmetics            5052.
##  6 Fruits               5237.
##  7 Household            4951.
##  8 Meat                 4765.
##  9 Office Supplies      5018.
## 10 Personal Care        5144.
## 11 Snacks               4827.
## 12 Vegetables           5098.

The most units sold are Fruits, followed by personal Care items.

Data_5000 %>%
    group_by(Item.Type) %>%
    summarise(Units.Sold = sum(Units.Sold))

## # A tibble: 12 x 2
##    Item.Type       Units.Sold
##    <chr>                <int>
##  1 Baby Food          2274921
##  2 Beverages          2208169
##  3 Cereal             1949601
##  4 Clothes            1983445
##  5 Cosmetics          2141909
##  6 Fruits             2341083
##  7 Household          2099238
##  8 Meat               1901197
##  9 Office Supplies    2107628
## 10 Personal Care      2134895
## 11 Snacks             1921075
## 12 Vegetables         2090330

Data_5000 %>%
    group_by(Item.Type) %>%
    summarise(Total.Profit = sum(Total.Profit))

## # A tibble: 12 x 2
##    Item.Type       Total.Profit
##    <chr>                  <dbl>
##  1 Baby Food         218073927.
##  2 Beverages          34579927.
##  3 Cereal            172715153.
##  4 Clothes           145664201.
##  5 Cosmetics         372413718.
##  6 Fruits              5642010.
##  7 Household         347906714.
##  8 Meat              108748468.
##  9 Office Supplies   266088035 
## 10 Personal Care      53500469.
## 11 Snacks            105928076.
## 12 Vegetables        131962533.

The most profitable items are Cosmetics, followed by Household items.

The least profitable items are Fruits although they have the most units sold.

Order.Priority

Data_5000 %>%
    group_by(Order.Priority) %>%
    summarise(Units.Sold = mean(Units.Sold))

## # A tibble: 4 x 2
##   Order.Priority Units.Sold
##   <chr>               <dbl>
## 1 C                   5048.
## 2 H                   5011.
## 3 L                   4893.
## 4 M                   5162.

Data_5000 %>%
    group_by(Order.Priority) %>%
    summarise(Total.Profit = mean(Total.Profit))

## # A tibble: 4 x 2
##   Order.Priority Total.Profit
##   <chr>                 <dbl>
## 1 C                   397007.
## 2 H                   398402.
## 3 L                   371385.
## 4 M                   402945.

Since Order priority is not clear to me and the total profit is almost the same for all types. Let’s explore if there is any other factor relates/affect the profit the priority.

Data_5000 %>%
    group_by(Order.Priority) %>%
    summarise(Unit.Price = mean(Unit.Price))

## # A tibble: 4 x 2
##   Order.Priority Unit.Price
##   <chr>               <dbl>
## 1 C                    276.
## 2 H                    268.
## 3 L                    260.
## 4 M                    260.

The unit price is almost the same for each order priority but at least we see that class M has higher profit and most units sold but also very close to Order Priority C and H. Class L has the least sales and profit.

# Convert date columns to date datatypes
Data_5000[['Order.Date']] <- as.Date(Data_5000[['Order.Date']], "%m/%d/%Y")
Data_5000[['Ship.Date']] <- as.Date(Data_5000[['Ship.Date']], "%m/%d/%Y")

Data_5000$Diff_Dates <- Data_5000$Ship.Date - Data_5000$Order.Date
Data_5000$Diff_Dates <- as.numeric(Data_5000$Diff_Dates)

Data_5000 %>%
    group_by(Order.Priority) %>%
    summarise(Diff_Dates = mean(Diff_Dates))

## # A tibble: 4 x 2
##   Order.Priority Diff_Dates
##   <chr>               <dbl>
## 1 C                    24.7
## 2 H                    25.3
## 3 L                    24.9
## 4 M                    25.2

Difference in the time between the order date and ship date is almost the same and does not impact the order priority.

We obviously notice that Ship.Date is dependent on the order date and both are correlated positively. Therefore, we will remove the ship date column

Correlation

#install.packages("GGally")
library(GGally)

## Warning: package 'GGally' was built under R version 4.1.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)
ggcorr(Data_5000, label = TRUE)

## Warning in ggcorr(Data_5000, label = TRUE): data in column(s) 'Region',
## 'Country', 'Item.Type', 'Sales.Channel', 'Order.Priority', 'Order.Date',
## 'Ship.Date' are not numeric and were ignored

From the correlation plot above, we can include that Order.ID and Diff_Dates are not useful in the data. In addition the ship date is not also useful as the order date. Therefore, we will remove the Order.ID, Diff_Dates, and ship.Date.

#Remove column Ship.Date.
Data_5000 <- Data_5000[ , -which(names(Data_5000) %in% c("Ship.Date"))]

#Remove column Order.ID 
Data_5000 <- Data_5000[ , -which(names(Data_5000) %in% c("Order.ID"))]

#Remove column Diff_Dates 
Data_5000 <- Data_5000[ , -which(names(Data_5000) %in% c("Diff_Dates"))]

Data preparation:

#Sales.Channel: 0 for Offline and 1 for Online
Data_5000$Sales.Channel = factor(Data_5000$Sales.Channel, levels = c('Offline','Online'),
                               labels = c('0','1'))

#Order.Priority: 0 for C, 1 for H, 2 for L, and 3 for M.
Data_5000$Order.Priority = factor(Data_5000$Order.Priority, levels = c('C','H','L', 'M'),
                               labels = c('0','1','2', '3'))


#Region: 0 for Asia, 1 for  Australia and Oceania, 2 for Central America and the Caribbean, 3 for Europe, 4 for Middle East and North Africa, 5 for North America, and 6 for Sub-Saharan Africa

Data_5000$Region = factor(Data_5000$Region, levels = c('Asia','Australia and Oceania','Central America and the Caribbean', 'Europe', 'Middle East and North Africa', ' North America','Sub-Saharan Africa'), labels = c('0','1','2', '3', '4', '5', '6'))

Machine Learning Algorithms

We are dealing with business data that has transaction records including revenues, profits and other important variables. Obviously, the profit is a dependent variable and relies on other factors such as, units.sold, channels, revenues, and even country. Therefore, I will consider the Total.profit variable as the output in the data. Machine learning algorithms can be used to predict the profitability of single of multiple transaction.

Two machine learning algorithms can be applied to these two datasets:

Linear regression algorithms.
Decision tree

Difference between Linear regression vs Decision Tree:

When there are large number of features with less data-sets(with low noise), linear regressions may outperform Decision trees/random forests. In general cases, Decision trees will be having better average accuracy.
For categorical independent variables, decision trees are better than linear regression. Decision trees handles colinearity better than LR.

I will train the LR and DT on the train data and test it on the test data to determine the MSE which is the indicator to be used to determine the chosen algorithms.

Data_5k <- Data_5000
#Split the Data to training and testing data to conduct a regression model
set.seed(123)
split=sample.split(Data_5k$Total.Profit, SplitRatio = 0.75)
Train_Data=subset(Data_5k,split == TRUE)
Test_Data=subset(Data_5k,split == FALSE)

Linear regression model

#Perform Linear regression model
set.seed(123)

LR_model <- lm(Total.Profit ~ Item.Type +
                 Units.Sold +
                 Sales.Channel +
                 Order.Priority +
                 Country
               , data=Train_Data)
#summary(LR_model)

#predicted values for test data

x_test= subset(Test_Data, select = -c(Total.Profit))
test.pred.regression <- predict(LR_model, newdata = x_test)

# Train RMSE
print("Linear regression MSE :")

## [1] "Linear regression MSE :"

print(sqrt(mean((test.pred.regression - Test_Data$Total.Profit)^2)))

## [1] 155764.3

Decision Tree

# DTR
set.seed(123)

rpart.control(maxdepth = 30)

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 30
## 
## $xval
## [1] 10

fit <- rpart( Total.Profit ~ Item.Type + Region + Sales.Channel + Order.Priority + Country + Units.Sold, method = "anova", data = Train_Data)

# detailed summary of splits
#summary(fit)

# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit) # visualize cross-validation results

## 
## Regression tree:
## rpart(formula = Total.Profit ~ Item.Type + Region + Sales.Channel + 
##     Order.Priority + Country + Units.Sold, data = Train_Data, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] Item.Type  Units.Sold
## 
## Root node error: 5.3989e+14/3750 = 1.4397e+11
## 
## n= 3750 
## 
##         CP nsplit rel error   xerror      xstd
## 1 0.342214      0  1.000000 1.000467 0.0289225
## 2 0.283656      1  0.657786 0.662435 0.0175066
## 3 0.076726      2  0.374130 0.373971 0.0083564
## 4 0.053013      4  0.220678 0.230813 0.0067690
## 5 0.035541      5  0.167665 0.169710 0.0044695
## 6 0.027597      6  0.132124 0.134139 0.0037096
## 7 0.014504      7  0.104527 0.106810 0.0027778
## 8 0.013645      8  0.090023 0.096990 0.0025250
## 9 0.010000      9  0.076378 0.081018 0.0018113

# plot tree
plot(fit, uniform=TRUE,
   main="Regression Tree for Total profit ")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postcript plot of tree
post(fit, title = "Regression Tree for Total profit ")

#predicted values for test data
DTR_pred <- predict(fit, newdata = x_test)
print("DTR MSE :")

## [1] "DTR MSE :"

print(sqrt(mean((DTR_pred - Test_Data$Total.Profit)^2)))

## [1] 107424.9

We decide to select Decision tree because the MSE is lower than the MSE of the LR BY ~30%.

Data 2 (10000 records)

#Import data 1 with 5000 records
Data_10000 <- read.csv("10000 Sales Records.csv")

#Sales.Channel: 0 for Offline and 1 for Online
Data_10000$Sales.Channel = factor(Data_10000$Sales.Channel, levels = c('Offline','Online'),
                               labels = c('0','1'))

#Order.Priority: 0 for C, 1 for H, 2 for L, and 3 for M.
Data_10000$Order.Priority = factor(Data_10000$Order.Priority, levels = c('C','H','L', 'M'),
                               labels = c('0','1','2', '3'))


#Region: 0 for Asia, 1 for  Australia and Oceania, 2 for Central America and the Caribbean, 3 for Europe, 4 for Middle East and North Africa, 5 for North America, and 6 for Sub-Saharan Africa

Data_10000$Region = factor(Data_10000$Region, levels = c('Asia','Australia and Oceania','Central America and the Caribbean', 'Europe', 'Middle East and North Africa', ' North America','Sub-Saharan Africa'), labels = c('0','1','2', '3', '4', '5', '6'))

#Remove column Ship.Date.
Data_10000 <- Data_10000[ , -which(names(Data_10000) %in% c("Ship.Date"))]

#Remove column Order.ID 
Data_10000 <- Data_10000[ , -which(names(Data_10000) %in% c("Order.ID"))]

#Split the Data to training and testing data to conduct a regression model
set.seed(123)
split=sample.split(Data_10000$Total.Profit, SplitRatio = 0.75)
Train_Data1=subset(Data_10000,split == TRUE)
Test_Data1=subset(Data_10000,split == FALSE)
x_test1= subset(Test_Data1, select = -c(Total.Profit))

# DTR
set.seed(123)

rpart.control(maxdepth = 30)

## $minsplit
## [1] 20
## 
## $minbucket
## [1] 7
## 
## $cp
## [1] 0.01
## 
## $maxcompete
## [1] 4
## 
## $maxsurrogate
## [1] 5
## 
## $usesurrogate
## [1] 2
## 
## $surrogatestyle
## [1] 0
## 
## $maxdepth
## [1] 30
## 
## $xval
## [1] 10

fit2 <- rpart( Total.Profit ~ Item.Type + Region + Sales.Channel + Order.Priority + Country + Units.Sold, method = "anova", data = Train_Data1)

# create additional plots
par(mfrow=c(1,2)) # two plots on one page
rsq.rpart(fit2) # visualize cross-validation results

## 
## Regression tree:
## rpart(formula = Total.Profit ~ Item.Type + Region + Sales.Channel + 
##     Order.Priority + Country + Units.Sold, data = Train_Data1, 
##     method = "anova")
## 
## Variables actually used in tree construction:
## [1] Item.Type  Units.Sold
## 
## Root node error: 1.0573e+15/7500 = 1.4097e+11
## 
## n= 7500 
## 
##         CP nsplit rel error   xerror      xstd
## 1 0.331633      0  1.000000 1.000198 0.0209569
## 2 0.267095      1  0.668367 0.668699 0.0115365
## 3 0.107173      2  0.401272 0.402242 0.0057911
## 4 0.034189      4  0.186926 0.188674 0.0036675
## 5 0.033383      5  0.152737 0.166476 0.0032248
## 6 0.018609      6  0.119354 0.121886 0.0021057
## 7 0.014744      7  0.100745 0.102781 0.0017518
## 8 0.012856      8  0.086001 0.087431 0.0016839
## 9 0.010000      9  0.073146 0.073617 0.0012078

# plot tree
plot(fit2, uniform=TRUE,
   main="Regression Tree for Total profit ")
text(fit2, use.n=TRUE, all=TRUE, cex=.8)

# create attractive postcript plot of tree
post(fit2, title = "Regression Tree for Total profit ")

#predicted values for test data
DTR_pred2 <- predict(fit2, newdata = x_test1)
print("DTR MSE :")

## [1] "DTR MSE :"

print(sqrt(mean((DTR_pred2 - Test_Data1$Total.Profit)^2)))

## [1] 103969.7

Future questions:

Should we increase the production of a specific item in a specific region? or introduce similar items?
Should we advertise using Online channels or open other physical locations in region(s) based on the item type?

In order to answer these questions, we need to use data with plus 1M records as we are going to split the data into multiple subsets based on the region.

These questions can help business planning more effective and hopefully lead to more profitability.

DATA 622 Assignment 1

Machine Learning Algorithms

Gehad Gad

March 18, 2022

Objective

Data 1 (5000 Records)

Exploratory Data Analysis

Data Summery

Checking unique values

Missing Vlaues

Region

Sales.Channel

Item.Type

Order.Priority

Correlation

Data preparation:

Machine Learning Algorithms

Linear regression model

Decision Tree

Data 2 (10000 records)

Future questions: