Data622_HWk1

Assignment:

Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).

https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.

Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.

Exploratory Data Analysis (EDA)

The process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it.

Impport Data and Data Structure

We imported the data from local drive. Another option could be to load the date from Github.

## 'data.frame':    1000 obs. of  14 variables:
##  $ Region        : chr  "Middle East and North Africa" "North America" "Middle East and North Africa" "Asia" ...
##  $ Country       : chr  "Libya" "Canada" "Libya" "Japan" ...
##  $ Item.Type     : chr  "Cosmetics" "Vegetables" "Baby Food" "Cereal" ...
##  $ Sales.Channel : chr  "Offline" "Online" "Offline" "Offline" ...
##  $ Order.Priority: chr  "M" "M" "C" "C" ...
##  $ Order.Date    : chr  "10/18/2014" "11/7/2011" "10/31/2016" "4/10/2010" ...
##  $ Order.ID      : int  686800706 185941302 246222341 161442649 645713555 683458888 679414975 208630645 266467225 118598544 ...
##  $ Ship.Date     : chr  "10/31/2014" "12/8/2011" "12/9/2016" "5/12/2010" ...
##  $ Units.Sold    : int  8446 3018 1517 3322 9845 9528 2844 7299 2428 4800 ...
##  $ Unit.Price    : num  437.2 154.06 255.28 205.7 9.33 ...
##  $ Unit.Cost     : num  263.33 90.93 159.42 117.11 6.92 ...
##  $ Total.Revenue : num  3692591 464953 387260 683335 91854 ...
##  $ Total.Cost    : num  2224085 274427 241840 389039 68127 ...
##  $ Total.Profit  : num  1468506 190526 145420 294296 23726 ...

Region	Country	Item.Type	Sales.Channel	Order.Priority	Order.Date	Order.ID	Ship.Date	Units.Sold	Unit.Price	Unit.Cost	Total.Revenue	Total.Cost	Total.Profit
Middle East and North Africa	Libya	Cosmetics	Offline	M	10/18/2014	686800706	10/31/2014	8446	437.20	263.33	3692591.20	2224085.2	1468506.02
North America	Canada	Vegetables	Online	M	11/7/2011	185941302	12/8/2011	3018	154.06	90.93	464953.08	274426.7	190526.34
Middle East and North Africa	Libya	Baby Food	Offline	C	10/31/2016	246222341	12/9/2016	1517	255.28	159.42	387259.76	241840.1	145419.62
Asia	Japan	Cereal	Offline	C	4/10/2010	161442649	5/12/2010	3322	205.70	117.11	683335.40	389039.4	294295.98
Sub-Saharan Africa	Chad	Fruits	Offline	H	8/16/2011	645713555	8/31/2011	9845	9.33	6.92	91853.85	68127.4	23726.45
Europe	Armenia	Cereal	Online	H	11/24/2014	683458888	12/28/2014	9528	205.70	117.11	1959909.60	1115824.1	844085.52
Sub-Saharan Africa	Eritrea	Cereal	Online	H	3/4/2015	679414975	4/17/2015	2844	205.70	117.11	585010.80	333060.8	251949.96
Europe	Montenegro	Clothes	Offline	M	5/17/2012	208630645	6/28/2012	7299	109.28	35.84	797634.72	261596.2	536038.56

The dataset “1000 Sales Records” has 1000 records or observations with 14 features. The datatypes in this dataset are characters and numericals. The characters datatype represent a designation of something. For example, name of a country where the customer is(I think so because it is not common to see vendors display information about the origin of a product they are selling) or it could be the name of the item a customer bought. The numericals datatype represent the finance of the shop/store for the most. This dataset is about recorded sale of a store which operates in hybrid environment selling various items to customers around the world. Based on this information about the structure of the dataset, we can conclude that we have a labeled data. Therefore, we can be confident in using supervised learning on this dataset. As we know, supervise learning is suitable for data that comes with labels(labeled data).

Since there is no a real problem statement on this dataset in order to apply the appropriate machine learning algorithm, we are going to formulate one or couple problems. For most profit businesses, profit is what drives the business. In most cases, businesses like to have a projection of future revenues in order to have a better planning. Looking at the revenue is almost looking at the profit. For this store, the most important factors that define the profit are the unit cost and the item type. Thus, one problem can be predicting the unit cost of each item or the popular item: Unit Prices Model. Another problem can be determine if the next customer purchase will be a popular or less popular item.

Unit Prices Model: This model is more suitable with forecast analysis. Since we have not talked about forecasting, the other option is regression analysis. In order to use the regression analysis, we need to know how the unit price of an item is made. However, we don’t know what determine the value of an item. Whether it is the origin of the item or the market demand or maybe the combination of the two. So, we will predict the profit that each item can generate. In other words, we are looking at this problem with financial prospective such growth of the business.

For this store, we will limit the feature for total profit to the following: unit price, unit sold, unit cost, total revenue, total cost.

Cleaning Data

#install.packages('Amelia')
library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

sum(is.na(data1000R))

## [1] 0

missmap(data1000R,col=c('yellow','black'),y.at=1,y.labels=' ',legend=TRUE)

#count((data1000R$Order.Priority))

#sum(is.na(data1000R$Order.Priority))
# Not sure why the code below does not work
# data1000R %>% 
#   group_by(data1000R$Order.Priority) %>%
#   summarize(Count=n()) %>%
#   mutate(Percent = (Count/sum(Count))*100) %>%
#   arrange(desc(Count))

We clearly see that there is no missing data.

Processing Data

Let’s remove the variables that we don’t need for this regression analysis. Then, we will reformat the dataset into a new data frame in which item type is grouped but this will define profit based on each item the store sells. This approach sounds somehow easy. When we think through on this approach, there is a potential bias that can be introduced into the new data frame and that is what happens if the unit price of a same item type differs from one region to another or from one country to another or from one date to another. All these presumptions appears to be realistic. Therefore, we want to explore the data to see whether such of price differentiation is indeed in the dataset.

data1000R %>%
  filter(Item.Type == "Cereal") %>%
  head(10)

##                               Region     Country Item.Type Sales.Channel
## 1                               Asia       Japan    Cereal       Offline
## 2                             Europe     Armenia    Cereal        Online
## 3                 Sub-Saharan Africa     Eritrea    Cereal        Online
## 4  Central America and the Caribbean     Grenada    Cereal        Online
## 5                 Sub-Saharan Africa        Mali    Cereal        Online
## 6       Middle East and North Africa    Tunisia     Cereal        Online
## 7                 Sub-Saharan Africa     Liberia    Cereal        Online
## 8       Middle East and North Africa       Libya    Cereal       Offline
## 9       Middle East and North Africa Afghanistan    Cereal        Online
## 10                Sub-Saharan Africa     Eritrea    Cereal       Offline
##    Order.Priority Order.Date  Order.ID  Ship.Date Units.Sold Unit.Price
## 1               C  4/10/2010 161442649  5/12/2010       3322      205.7
## 2               H 11/24/2014 683458888 12/28/2014       9528      205.7
## 3               H   3/4/2015 679414975  4/17/2015       2844      205.7
## 4               H 10/28/2012 430390107 11/13/2012        852      205.7
## 5               L  3/17/2012 235702931   4/3/2012       8590      205.7
## 6               L   4/9/2010 221975171  5/17/2010       6241      205.7
## 7               L   2/8/2015 977313554  3/29/2015       7653      205.7
## 8               M  3/27/2014 964214932  3/31/2014       1480      205.7
## 9               M 10/13/2016 410067975 11/20/2016       7081      205.7
## 10              C   9/3/2014 775076282  9/19/2014       1150      205.7
##    Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     117.11      683335.4  389039.42    294295.98
## 2     117.11     1959909.6 1115824.08    844085.52
## 3     117.11      585010.8  333060.84    251949.96
## 4     117.11      175256.4   99777.72     75478.68
## 5     117.11     1766963.0 1005974.90    760988.10
## 6     117.11     1283773.7  730883.51    552890.19
## 7     117.11     1574222.1  896242.83    677979.27
## 8     117.11      304436.0  173322.80    131113.20
## 9     117.11     1456561.7  829255.91    627305.79
## 10    117.11      236555.0  134676.50    101878.50

data1000R %>%
  filter(Item.Type == "Beverages") %>%
  head(10)

##                               Region    Country Item.Type Sales.Channel
## 1                      North America  Greenland Beverages        Online
## 2  Central America and the Caribbean    Grenada Beverages        Online
## 3                 Sub-Saharan Africa    Senegal Beverages       Offline
## 4       Middle East and North Africa    Morocco Beverages       Offline
## 5  Central America and the Caribbean    Jamaica Beverages       Offline
## 6                               Asia   Mongolia Beverages        Online
## 7                               Asia       Laos Beverages        Online
## 8                             Europe    Croatia Beverages        Online
## 9                             Europe    Belarus Beverages       Offline
## 10                Sub-Saharan Africa Cape Verde Beverages       Offline
##    Order.Priority Order.Date  Order.ID  Ship.Date Units.Sold Unit.Price
## 1               M  7/27/2012 414244067   8/7/2012       2880      47.45
## 2               M  1/30/2017 397877871  3/20/2017       9759      47.45
## 3               M 10/22/2014 683927953  11/4/2014       8334      47.45
## 4               C   6/1/2017 944415509  6/23/2017       2391      47.45
## 5               L   9/4/2010 262056386 10/24/2010       7163      47.45
## 6               M   8/3/2013 329110324   9/2/2013       9913      47.45
## 7               M  3/22/2013 693473613  4/21/2013       3107      47.45
## 8               C  6/16/2016 681941401  7/28/2016        470      47.45
## 9               L   4/3/2013 248335492   4/4/2013       6846      47.45
## 10              H 10/23/2013 858877503  11/6/2013       9794      47.45
##    Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1      31.79      136656.0   91555.20     45100.80
## 2      31.79      463064.5  310238.61    152825.94
## 3      31.79      395448.3  264937.86    130510.44
## 4      31.79      113452.9   76009.89     37443.06
## 5      31.79      339884.3  227711.77    112172.58
## 6      31.79      470371.8  315134.27    155237.58
## 7      31.79      147427.1   98771.53     48655.62
## 8      31.79       22301.5   14941.30      7360.20
## 9      31.79      324842.7  217634.34    107208.36
## 10     31.79      464725.3  311351.26    153374.04

data1000R %>%
  group_by(Item.Type)

## # A tibble: 1,000 x 14
## # Groups:   Item.Type [12]
##    Region     Country Item.Type Sales.Channel Order.Priority Order.Date Order.ID
##    <chr>      <chr>   <chr>     <chr>         <chr>          <chr>         <int>
##  1 Middle Ea~ Libya   Cosmetics Offline       M              10/18/2014   6.87e8
##  2 North Ame~ Canada  Vegetabl~ Online        M              11/7/2011    1.86e8
##  3 Middle Ea~ Libya   Baby Food Offline       C              10/31/2016   2.46e8
##  4 Asia       Japan   Cereal    Offline       C              4/10/2010    1.61e8
##  5 Sub-Sahar~ Chad    Fruits    Offline       H              8/16/2011    6.46e8
##  6 Europe     Armenia Cereal    Online        H              11/24/2014   6.83e8
##  7 Sub-Sahar~ Eritrea Cereal    Online        H              3/4/2015     6.79e8
##  8 Europe     Monten~ Clothes   Offline       M              5/17/2012    2.09e8
##  9 Central A~ Jamaica Vegetabl~ Online        H              1/29/2015    2.66e8
## 10 Australia~ Fiji    Vegetabl~ Offline       H              12/24/2013   1.19e8
## # ... with 990 more rows, and 7 more variables: Ship.Date <chr>,
## #   Units.Sold <int>, Unit.Price <dbl>, Unit.Cost <dbl>, Total.Revenue <dbl>,
## #   Total.Cost <dbl>, Total.Profit <dbl>

Based on the item type “Cereal”, we observed that the price does not really change regardless of other factors. Meaning the unit price is fixed. We have verified the presumption for one item. How about other items sold by this store? We used groupby() function and since we didn’t get any error, we will assume the presumption is also verified for all items sold by the store. There might be a global view to see all items by individual table. Now we have verified the presumption, we can remove unnecessary variables.

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt

## The following object is masked from 'package:tsibble':
## 
##     key

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

# We want to check which item is popular.
data2 <- data.table( ItemType = data1000R$Item.Type)
data2[,.(count = .N), by = ItemType][, percent := prop.table(count)*100][]

##            ItemType count percent
##  1:       Cosmetics    75     7.5
##  2:      Vegetables    97     9.7
##  3:       Baby Food    87     8.7
##  4:          Cereal    79     7.9
##  5:          Fruits    70     7.0
##  6:         Clothes    78     7.8
##  7:          Snacks    82     8.2
##  8:       Household    77     7.7
##  9: Office Supplies    89     8.9
## 10:       Beverages   101    10.1
## 11:   Personal Care    87     8.7
## 12:            Meat    78     7.8

data1000R1 <- data1000R %>%
  filter(Item.Type == "Beverages") %>%
                select(-c(Region,Order.Priority, Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))

data1000R1 %>%
  head(6)

##   Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1       2880      47.45     31.79      136656.0   91555.20     45100.80
## 2       9759      47.45     31.79      463064.5  310238.61    152825.94
## 3       8334      47.45     31.79      395448.3  264937.86    130510.44
## 4       2391      47.45     31.79      113452.9   76009.89     37443.06
## 5       7163      47.45     31.79      339884.3  227711.77    112172.58
## 6       9913      47.45     31.79      470371.8  315134.27    155237.58

Summary and Correlation

This is a summary and correlation of the popular item known as “Beverage”

  summary(data1000R1)

##    Units.Sold     Unit.Price      Unit.Cost     Total.Revenue   
##  Min.   : 114   Min.   :47.45   Min.   :31.79   Min.   :  5409  
##  1st Qu.:2111   1st Qu.:47.45   1st Qu.:31.79   1st Qu.:100167  
##  Median :4571   Median :47.45   Median :31.79   Median :216894  
##  Mean   :4999   Mean   :47.45   Mean   :31.79   Mean   :237205  
##  3rd Qu.:8282   3rd Qu.:47.45   3rd Qu.:31.79   3rd Qu.:392981  
##  Max.   :9919   Max.   :47.45   Max.   :31.79   Max.   :470657  
##    Total.Cost      Total.Profit   
##  Min.   :  3624   Min.   :  1785  
##  1st Qu.: 67109   1st Qu.: 33058  
##  Median :145312   Median : 71582  
##  Mean   :158920   Mean   : 78285  
##  3rd Qu.:263285   3rd Qu.:129696  
##  Max.   :315325   Max.   :155332

#as.numeric(data1000R1$Units.Sold)
#library(Hmisc)
#data1 <- data.frame(data1000R1)
cor(data1000R1)

## Warning in cor(data1000R1): the standard deviation is zero

##               Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost
## Units.Sold             1         NA        NA             1          1
## Unit.Price            NA          1        NA            NA         NA
## Unit.Cost             NA         NA         1            NA         NA
## Total.Revenue          1         NA        NA             1          1
## Total.Cost             1         NA        NA             1          1
## Total.Profit           1         NA        NA             1          1
##               Total.Profit
## Units.Sold               1
## Unit.Price              NA
## Unit.Cost               NA
## Total.Revenue            1
## Total.Cost               1
## Total.Profit             1

#cor(data1000R1[,unlist(lapply(data1000R1, is.numeric))])
#rcorr(as.matrix(data1000R1), type = "Pearson")

Something is wrong with the correlation. we think the fact that the unit price is fixed might be the cause of such correlation output.

Building Model 1 +Visualization

# # load package
# #install.packages("ggstatsplot")
# library(ggstatsplot)
# 
# # correlogram
# ggstatsplot::ggcorrmat(
#   data = data1000R1,
#   type = "parametric", # parametric for Pearson, nonparametric for Spearman's correlation
#   colors = c("darkred", "white", "steelblue") # change default colors
# )

set.seed(232)

library(caTools)
data1000R1s <- sample.split(data1000R1, SplitRatio = 0.70)
train1 <- subset(data1000R1, data1000R1s == TRUE)
test1 <- subset(data1000R1, data1000R1s == FALSE)

model1 <- lm(Total.Profit~., train1)
summary(model1)

## Warning in summary.lm(model1): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = Total.Profit ~ ., data = train1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.067e-10 -5.138e-12  1.590e-12  9.750e-12  1.243e-10 
## 
## Coefficients: (4 not defined because of singularities)
##                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)   0.000e+00  1.044e-11 0.000e+00        1    
## Units.Sold    1.566e+01  1.696e-15 9.235e+15   <2e-16 ***
## Unit.Price           NA         NA        NA       NA    
## Unit.Cost            NA         NA        NA       NA    
## Total.Revenue        NA         NA        NA       NA    
## Total.Cost           NA         NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.207e-11 on 65 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.529e+31 on 1 and 65 DF,  p-value: < 2.2e-16

plot (model1, which = 2)

plot (model1, which = 1)

There is something strange on the regression performance. The R-squared value is perfect showing only one variable (Unit.Sold) has influence on the total profit. The multilinear regression model could be just a simple linear regression model. This is a bit hard to admit. We want to try to call another function for partionning the data.

partition <- createDataPartition(data1000R1$Total.Profit, p = 0.70, list = FALSE)
train1s <- data1000R1[partition,]
test1s <- data1000R1[-partition,]
dim(train1s)

## [1] 73  6

dim(test1s)

## [1] 28  6

# Fitting the model
model1s <- lm(Total.Profit~Units.Sold
+ Unit.Price+Unit.Cost+Total.Revenue
+Total.Cost, data = train1s)
summary(model1s)

## Warning in summary.lm(model1s): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Price + Unit.Cost + 
##     Total.Revenue + Total.Cost, data = train1s)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.283e-11 -6.076e-12 -1.812e-12  2.609e-12  1.079e-10 
## 
## Coefficients: (4 not defined because of singularities)
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -2.725e-11  3.371e-12 -8.083e+00 1.17e-11 ***
## Units.Sold     1.566e+01  5.796e-16  2.702e+16  < 2e-16 ***
## Unit.Price            NA         NA         NA       NA    
## Unit.Cost             NA         NA         NA       NA    
## Total.Revenue         NA         NA         NA       NA    
## Total.Cost            NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.557e-11 on 71 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.3e+32 on 1 and 71 DF,  p-value: < 2.2e-16

plot(model1s, which = 2)

Same results!

Model1 Accuracy

pred1 <- predict(model1s, newdata = test1s)

## Warning in predict.lm(model1s, newdata = test1s): prediction from a rank-
## deficient fit may be misleading

check <- data.frame(test1s$Total.Profit, pred1, residuals = test1s$Total.Profit - pred1)
check %>%
  head(8)

##    test1s.Total.Profit     pred1     residuals
## 6            155237.58 155237.58 -5.820766e-11
## 7             48655.62  48655.62  7.275958e-12
## 9            107208.36 107208.36 -1.455192e-11
## 13            14047.02  14047.02  2.182787e-11
## 17            75340.26  75340.26 -1.455192e-11
## 19           139436.64 139436.64 -2.910383e-11
## 31            65552.76  65552.76  0.000000e+00
## 32            58396.14  58396.14  0.000000e+00

MSE <- mean((test1s$Total.Profit - pred1)^2)
print(MSE)

## [1] 5.678004e-22

test1s$Predicted.TotalProfit <- predict(model1s,test1s)

## Warning in predict.lm(model1s, test1s): prediction from a rank-deficient fit may
## be misleading

actual_pred <- data.frame(test1s$Total.Profit, test1s$Predicted.TotalProfit)
names(actual_pred) <- c("Actual.Total.Profit", "Predicted.Total.Profit" )
accuracy1 <- cor(actual_pred)
accuracy1

##                        Actual.Total.Profit Predicted.Total.Profit
## Actual.Total.Profit                      1                      1
## Predicted.Total.Profit                   1                      1

head(actual_pred)

##   Actual.Total.Profit Predicted.Total.Profit
## 1           155237.58              155237.58
## 2            48655.62               48655.62
## 3           107208.36              107208.36
## 4            14047.02               14047.02
## 5            75340.26               75340.26
## 6           139436.64              139436.64

test1s$Predicted.TotalProfit <- predict(model1s,test1s)

## Warning in predict.lm(model1s, test1s): prediction from a rank-deficient fit may
## be misleading

plot1s <-test1s %>% 
  ggplot(aes(Total.Profit,Predicted.TotalProfit)) +
  geom_point(alpha=0.5) + 
  stat_smooth(aes(colour='black')) +
  xlab('Actual value of Total Profit for Beverages') +
  ylab('Predicted value of Beverages')+
  theme_bw()
ggplotly(plot1s)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Unusual results we shall say. The correlation shows that Model accuracy is 100%. At this point, not sure what to think of. Another idea is that the approach was not good or the formulation of the problem statement was not correct or sufficiant.

We built a model for the most popular item sold by the store. What if we want to predict the total profit generated by all the items sold by the store. This is a bit complex but can be solvable. Let’s try to see if we will get the same result with a bigger dataset (one million records).

data1000000R <- read.csv("1000000 Sales Records.csv", stringsAsFactors=FALSE)
# Loading data
# df1 <- read.transactions('https://raw.githubusercontent.com/asmozo24/Data624_Market_Basket_Analysis/main/GroceryDataSet.csv', 
#                             sep = ',', rm.duplicates = TRUE)

View(data1000000R)
#glimpse(basket)

str(data1000000R)

## 'data.frame':    1000000 obs. of  14 variables:
##  $ Region        : chr  "Sub-Saharan Africa" "Middle East and North Africa" "Australia and Oceania" "Sub-Saharan Africa" ...
##  $ Country       : chr  "South Africa" "Morocco" "Papua New Guinea" "Djibouti" ...
##  $ Item.Type     : chr  "Fruits" "Clothes" "Meat" "Clothes" ...
##  $ Sales.Channel : chr  "Offline" "Online" "Offline" "Offline" ...
##  $ Order.Priority: chr  "M" "M" "M" "H" ...
##  $ Order.Date    : chr  "7/27/2012" "9/14/2013" "5/15/2015" "5/17/2017" ...
##  $ Order.ID      : int  443368995 667593514 940995585 880811536 174590194 830192887 425793445 659878194 601245963 739008080 ...
##  $ Ship.Date     : chr  "7/28/2012" "10/19/2013" "6/4/2015" "7/2/2017" ...
##  $ Units.Sold    : int  1593 4611 360 562 3973 1379 597 1476 896 7768 ...
##  $ Unit.Price    : num  9.33 109.28 421.89 109.28 47.45 ...
##  $ Unit.Cost     : num  6.92 35.84 364.69 35.84 31.79 ...
##  $ Total.Revenue : num  14863 503890 151880 61415 188519 ...
##  $ Total.Cost    : num  11024 165258 131288 20142 126302 ...
##  $ Total.Profit  : num  3839 338632 20592 41273 62217 ...

# 
# data1000R %>%
#   head(8)%>%
#   kable()

library('data.table')
# data1 <- data.table( orderPriority = data1000R$Order.Priority)
# data1[,.(count = .N), by = orderPriority][, percent := prop.table(count)*100][]

We will reproduce the same code used on the 1000 records dataset.

sum(is.na(data1000000R))

## [1] 0

data3 <- data.table( ItemType = data1000000R$Item.Type)
data3[,.(count = .N), by = ItemType][, percent := prop.table(count)*100][]

##            ItemType count percent
##  1:          Fruits 83551  8.3551
##  2:         Clothes 83240  8.3240
##  3:            Meat 83198  8.3198
##  4:       Beverages 83326  8.3326
##  5: Office Supplies 83222  8.3222
##  6:       Cosmetics 83431  8.3431
##  7:          Snacks 83448  8.3448
##  8:   Personal Care 83539  8.3539
##  9:       Household 83267  8.3267
## 10:      Vegetables 83170  8.3170
## 11:       Baby Food 83397  8.3397
## 12:          Cereal 83211  8.3211

data1000000R1 <- data1000000R %>%
  filter(Item.Type == "Fruits") %>%
                select(-c(Region,Order.Priority, Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))

data1000000R1 %>%
  head(6)

##   Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1       1593       9.33      6.92      14862.69   11023.56      3839.13
## 2       1379       9.33      6.92      12866.07    9542.68      3323.39
## 3       8034       9.33      6.92      74957.22   55595.28     19361.94
## 4       5735       9.33      6.92      53507.55   39686.20     13821.35
## 5       9599       9.33      6.92      89558.67   66425.08     23133.59
## 6       3581       9.33      6.92      33410.73   24780.52      8630.21

Building Model2

partition2 <- createDataPartition(data1000000R1$Total.Profit, p = 0.70, list = FALSE)
train2s <- data1000000R1[partition,]
test2s <- data1000000R1[-partition,]
dim(train2s)

## [1] 73  6

dim(test2s)

## [1] 83478     6

# Fitting the model
model2s <- lm(Total.Profit~Units.Sold
+ Unit.Price+Unit.Cost+Total.Revenue
+Total.Cost, data = train1s)
summary(model2s)

## Warning in summary.lm(model2s): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Price + Unit.Cost + 
##     Total.Revenue + Total.Cost, data = train1s)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.283e-11 -6.076e-12 -1.812e-12  2.609e-12  1.079e-10 
## 
## Coefficients: (4 not defined because of singularities)
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -2.725e-11  3.371e-12 -8.083e+00 1.17e-11 ***
## Units.Sold     1.566e+01  5.796e-16  2.702e+16  < 2e-16 ***
## Unit.Price            NA         NA         NA       NA    
## Unit.Cost             NA         NA         NA       NA    
## Total.Revenue         NA         NA         NA       NA    
## Total.Cost            NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.557e-11 on 71 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.3e+32 on 1 and 71 DF,  p-value: < 2.2e-16

plot(model2s, which = 2)

Model2 Accuracy

pred2 <- predict(model2s, newdata = test2s)

## Warning in predict.lm(model2s, newdata = test2s): prediction from a rank-
## deficient fit may be misleading

check2 <- data.frame(test2s$Total.Profit, pred2, residuals = test2s$Total.Profit - pred2)
check2 %>%
  head(8)

##    test2s.Total.Profit     pred2  residuals
## 6              8630.21  56078.46  -47448.25
## 7             23827.67 154830.42 -131002.75
## 9              3747.55  24351.30  -20603.75
## 13            11423.40  74228.40  -62805.00
## 17             7389.06  48013.56  -40624.50
## 19            15978.30 103825.80  -87847.50
## 31            13170.65  85581.90  -72411.25
## 32             3284.83  21344.58  -18059.75

MSE <- mean((test2s$Total.Profit - pred2)^2)
print(MSE)

## [1] 5841777225

test2s$Predicted.TotalProfit <- predict(model2s,test2s)

## Warning in predict.lm(model2s, test2s): prediction from a rank-deficient fit may
## be misleading

actual_pred2 <- data.frame(test2s$Total.Profit, test2s$Predicted.TotalProfit)
names(actual_pred2) <- c("Actual.Total.Profit", "Predicted.Total.Profit" )
accuracy2 <- cor(actual_pred2)
accuracy2

##                        Actual.Total.Profit Predicted.Total.Profit
## Actual.Total.Profit                      1                      1
## Predicted.Total.Profit                   1                      1

head(actual_pred2)

##   Actual.Total.Profit Predicted.Total.Profit
## 1             8630.21               56078.46
## 2            23827.67              154830.42
## 3             3747.55               24351.30
## 4            11423.40               74228.40
## 5             7389.06               48013.56
## 6            15978.30              103825.80

test2s$Predicted.TotalProfit <- predict(model2s,test2s)

## Warning in predict.lm(model2s, test2s): prediction from a rank-deficient fit may
## be misleading

plot2s <-test2s %>% 
  ggplot(aes(Total.Profit,Predicted.TotalProfit)) +
  geom_point(alpha=0.5) + 
  stat_smooth(aes(colour='black')) +
  xlab('Actual value of Total Profit for Fruits') +
  ylab('Predicted value of Fruits')+
  theme_bw()
ggplotly(plot2s)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We obtain the Same result with one million records compared to the 1000 records.

Another problem we could articulate on these dataset is: Predict if the next customer purchase is a popular item or not. In order to do this, we would have to generate a new variable called “Popularity” which can take value 1 or 0 or yes/no. This variable would be based on certain criteria that we wish there were given. Rather, we can try to determine the priority of the next customer order. We can use decision tree to predict the priority of the next customer order.

data3 <- data.table( OrderPriority = data1000R$Order.Priority)
data3[,.(count = .N), by = OrderPriority][, percent := prop.table(count)*100][]

##    OrderPriority count percent
## 1:             M   242    24.2
## 2:             C   262    26.2
## 3:             H   228    22.8
## 4:             L   268    26.8

#data1000R2 <- data1000R %>%
#  group_by(Order.Priority)

Looking at the features within the dataset, it is bit hard to say whether variables like Region, Country , Sale.Channel, Item.Type, Order.ID, and Order.Date. We say so because there is no really trend or logic observed for each of this variable having influence over the order priority. We make assumption that the variables we used for the regression analysis also have influence on the order priority.

data1000R$Order.Priority <- as.factor(data1000R$Order.Priority)
#str(data1000R2)

data1000R2 <- data1000R %>%
  select(-c(Region,Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))

data1000R2 %>%
  head(6)

##   Order.Priority Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost
## 1              M       8446     437.20    263.33    3692591.20  2224085.2
## 2              M       3018     154.06     90.93     464953.08   274426.7
## 3              C       1517     255.28    159.42     387259.76   241840.1
## 4              C       3322     205.70    117.11     683335.40   389039.4
## 5              H       9845       9.33      6.92      91853.85    68127.4
## 6              H       9528     205.70    117.11    1959909.60  1115824.1
##   Total.Profit
## 1   1468506.02
## 2    190526.34
## 3    145419.62
## 4    294295.98
## 5     23726.45
## 6    844085.52

Build Model 3 Based Decision Tree

library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## 
## Attaching package: 'modeltools'

## The following object is masked from 'package:arules':
## 
##     info

## The following object is masked from 'package:plyr':
## 
##     empty

## The following object is masked from 'package:BayesFactor':
## 
##     posterior

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

data4 = sample.split(data1000R2, SplitRatio = 0.70)
train4 <- subset(data1000R2, data4 == TRUE)
test4 <- subset(data1000R2, data4 == FALSE)
model4 <- ctree(Order.Priority ~ ., train4)
plot(model4)

Prediction

pred4 <- predict(model4, test4)
classifier <- table(test4$Order.Priority, pred4)
classifier

##    pred4
##       C   H   L   M
##   C   0   0 119   0
##   H   0   0  95   0
##   L   0   0 114   0
##   M   0   0 101   0

The model4 correctly predicted the next customer order to be only Order Priority “C”.

Model4 Accuracy

accuracy4 <- sum(diag(classifier))/sum(classifier)
accuracy4

## [1] 0.2657343

The model 4 accuracy is about 21.5% which is relative low.

data1000000R$Order.Priority <- as.factor(data1000000R$Order.Priority)
#str(data1000R2)

data1000000R2 <- data1000000R %>%
  select(-c(Region,Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))


data5 = sample.split(data1000000R2, SplitRatio = 0.70)
train5 <- subset(data1000000R2, data5 == TRUE)
test5 <- subset(data1000000R2, data5 == FALSE)
model5 <- ctree(Order.Priority ~ ., train4)
summary(model5)

##     Length      Class       Mode 
##          1 BinaryTree         S4

plot(model5)

pred5 <- predict(model5, test5)
classifier2 <- table(test5$Order.Priority, pred5)
classifier2

##    pred5
##          C      H      L      M
##   C      0      0 107226      0
##   H      0      0 106780      0
##   L      0      0 107478      0
##   M      0      0 107088      0

accuracy5 <- sum(diag(classifier2))/sum(classifier2)
accuracy5

## [1] 0.2507817

We got the same result with one million records compared to 1000 records. In conclusion, We want to say that the increasing the number of records for this dataset did not have an influence on the performance of the two machine learning algorithms (multilinear regression and decision tree).