Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).
https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.
Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
Develop your exploratory analysis of the data and the essay in the following 2 weeks. You’ll have until March 17 to submit both.
Exploratory Data Analysis (EDA)
The process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it.
We imported the data from local drive. Another option could be to load the date from Github.
## 'data.frame': 1000 obs. of 14 variables:
## $ Region : chr "Middle East and North Africa" "North America" "Middle East and North Africa" "Asia" ...
## $ Country : chr "Libya" "Canada" "Libya" "Japan" ...
## $ Item.Type : chr "Cosmetics" "Vegetables" "Baby Food" "Cereal" ...
## $ Sales.Channel : chr "Offline" "Online" "Offline" "Offline" ...
## $ Order.Priority: chr "M" "M" "C" "C" ...
## $ Order.Date : chr "10/18/2014" "11/7/2011" "10/31/2016" "4/10/2010" ...
## $ Order.ID : int 686800706 185941302 246222341 161442649 645713555 683458888 679414975 208630645 266467225 118598544 ...
## $ Ship.Date : chr "10/31/2014" "12/8/2011" "12/9/2016" "5/12/2010" ...
## $ Units.Sold : int 8446 3018 1517 3322 9845 9528 2844 7299 2428 4800 ...
## $ Unit.Price : num 437.2 154.06 255.28 205.7 9.33 ...
## $ Unit.Cost : num 263.33 90.93 159.42 117.11 6.92 ...
## $ Total.Revenue : num 3692591 464953 387260 683335 91854 ...
## $ Total.Cost : num 2224085 274427 241840 389039 68127 ...
## $ Total.Profit : num 1468506 190526 145420 294296 23726 ...
Region | Country | Item.Type | Sales.Channel | Order.Priority | Order.Date | Order.ID | Ship.Date | Units.Sold | Unit.Price | Unit.Cost | Total.Revenue | Total.Cost | Total.Profit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Middle East and North Africa | Libya | Cosmetics | Offline | M | 10/18/2014 | 686800706 | 10/31/2014 | 8446 | 437.20 | 263.33 | 3692591.20 | 2224085.2 | 1468506.02 |
North America | Canada | Vegetables | Online | M | 11/7/2011 | 185941302 | 12/8/2011 | 3018 | 154.06 | 90.93 | 464953.08 | 274426.7 | 190526.34 |
Middle East and North Africa | Libya | Baby Food | Offline | C | 10/31/2016 | 246222341 | 12/9/2016 | 1517 | 255.28 | 159.42 | 387259.76 | 241840.1 | 145419.62 |
Asia | Japan | Cereal | Offline | C | 4/10/2010 | 161442649 | 5/12/2010 | 3322 | 205.70 | 117.11 | 683335.40 | 389039.4 | 294295.98 |
Sub-Saharan Africa | Chad | Fruits | Offline | H | 8/16/2011 | 645713555 | 8/31/2011 | 9845 | 9.33 | 6.92 | 91853.85 | 68127.4 | 23726.45 |
Europe | Armenia | Cereal | Online | H | 11/24/2014 | 683458888 | 12/28/2014 | 9528 | 205.70 | 117.11 | 1959909.60 | 1115824.1 | 844085.52 |
Sub-Saharan Africa | Eritrea | Cereal | Online | H | 3/4/2015 | 679414975 | 4/17/2015 | 2844 | 205.70 | 117.11 | 585010.80 | 333060.8 | 251949.96 |
Europe | Montenegro | Clothes | Offline | M | 5/17/2012 | 208630645 | 6/28/2012 | 7299 | 109.28 | 35.84 | 797634.72 | 261596.2 | 536038.56 |
The dataset “1000 Sales Records” has 1000 records or observations with 14 features. The datatypes in this dataset are characters and numericals. The characters datatype represent a designation of something. For example, name of a country where the customer is(I think so because it is not common to see vendors display information about the origin of a product they are selling) or it could be the name of the item a customer bought. The numericals datatype represent the finance of the shop/store for the most. This dataset is about recorded sale of a store which operates in hybrid environment selling various items to customers around the world. Based on this information about the structure of the dataset, we can conclude that we have a labeled data. Therefore, we can be confident in using supervised learning on this dataset. As we know, supervise learning is suitable for data that comes with labels(labeled data).
Since there is no a real problem statement on this dataset in order to apply the appropriate machine learning algorithm, we are going to formulate one or couple problems. For most profit businesses, profit is what drives the business. In most cases, businesses like to have a projection of future revenues in order to have a better planning. Looking at the revenue is almost looking at the profit. For this store, the most important factors that define the profit are the unit cost and the item type. Thus, one problem can be predicting the unit cost of each item or the popular item: Unit Prices Model. Another problem can be determine if the next customer purchase will be a popular or less popular item.
Unit Prices Model: This model is more suitable with forecast analysis. Since we have not talked about forecasting, the other option is regression analysis. In order to use the regression analysis, we need to know how the unit price of an item is made. However, we don’t know what determine the value of an item. Whether it is the origin of the item or the market demand or maybe the combination of the two. So, we will predict the profit that each item can generate. In other words, we are looking at this problem with financial prospective such growth of the business.
For this store, we will limit the feature for total profit to the following: unit price, unit sold, unit cost, total revenue, total cost.
#install.packages('Amelia')
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
sum(is.na(data1000R))
## [1] 0
missmap(data1000R,col=c('yellow','black'),y.at=1,y.labels=' ',legend=TRUE)
#count((data1000R$Order.Priority))
#sum(is.na(data1000R$Order.Priority))
# Not sure why the code below does not work
# data1000R %>%
# group_by(data1000R$Order.Priority) %>%
# summarize(Count=n()) %>%
# mutate(Percent = (Count/sum(Count))*100) %>%
# arrange(desc(Count))
We clearly see that there is no missing data.
Let’s remove the variables that we don’t need for this regression analysis. Then, we will reformat the dataset into a new data frame in which item type is grouped but this will define profit based on each item the store sells. This approach sounds somehow easy. When we think through on this approach, there is a potential bias that can be introduced into the new data frame and that is what happens if the unit price of a same item type differs from one region to another or from one country to another or from one date to another. All these presumptions appears to be realistic. Therefore, we want to explore the data to see whether such of price differentiation is indeed in the dataset.
data1000R %>%
filter(Item.Type == "Cereal") %>%
head(10)
## Region Country Item.Type Sales.Channel
## 1 Asia Japan Cereal Offline
## 2 Europe Armenia Cereal Online
## 3 Sub-Saharan Africa Eritrea Cereal Online
## 4 Central America and the Caribbean Grenada Cereal Online
## 5 Sub-Saharan Africa Mali Cereal Online
## 6 Middle East and North Africa Tunisia Cereal Online
## 7 Sub-Saharan Africa Liberia Cereal Online
## 8 Middle East and North Africa Libya Cereal Offline
## 9 Middle East and North Africa Afghanistan Cereal Online
## 10 Sub-Saharan Africa Eritrea Cereal Offline
## Order.Priority Order.Date Order.ID Ship.Date Units.Sold Unit.Price
## 1 C 4/10/2010 161442649 5/12/2010 3322 205.7
## 2 H 11/24/2014 683458888 12/28/2014 9528 205.7
## 3 H 3/4/2015 679414975 4/17/2015 2844 205.7
## 4 H 10/28/2012 430390107 11/13/2012 852 205.7
## 5 L 3/17/2012 235702931 4/3/2012 8590 205.7
## 6 L 4/9/2010 221975171 5/17/2010 6241 205.7
## 7 L 2/8/2015 977313554 3/29/2015 7653 205.7
## 8 M 3/27/2014 964214932 3/31/2014 1480 205.7
## 9 M 10/13/2016 410067975 11/20/2016 7081 205.7
## 10 C 9/3/2014 775076282 9/19/2014 1150 205.7
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 117.11 683335.4 389039.42 294295.98
## 2 117.11 1959909.6 1115824.08 844085.52
## 3 117.11 585010.8 333060.84 251949.96
## 4 117.11 175256.4 99777.72 75478.68
## 5 117.11 1766963.0 1005974.90 760988.10
## 6 117.11 1283773.7 730883.51 552890.19
## 7 117.11 1574222.1 896242.83 677979.27
## 8 117.11 304436.0 173322.80 131113.20
## 9 117.11 1456561.7 829255.91 627305.79
## 10 117.11 236555.0 134676.50 101878.50
data1000R %>%
filter(Item.Type == "Beverages") %>%
head(10)
## Region Country Item.Type Sales.Channel
## 1 North America Greenland Beverages Online
## 2 Central America and the Caribbean Grenada Beverages Online
## 3 Sub-Saharan Africa Senegal Beverages Offline
## 4 Middle East and North Africa Morocco Beverages Offline
## 5 Central America and the Caribbean Jamaica Beverages Offline
## 6 Asia Mongolia Beverages Online
## 7 Asia Laos Beverages Online
## 8 Europe Croatia Beverages Online
## 9 Europe Belarus Beverages Offline
## 10 Sub-Saharan Africa Cape Verde Beverages Offline
## Order.Priority Order.Date Order.ID Ship.Date Units.Sold Unit.Price
## 1 M 7/27/2012 414244067 8/7/2012 2880 47.45
## 2 M 1/30/2017 397877871 3/20/2017 9759 47.45
## 3 M 10/22/2014 683927953 11/4/2014 8334 47.45
## 4 C 6/1/2017 944415509 6/23/2017 2391 47.45
## 5 L 9/4/2010 262056386 10/24/2010 7163 47.45
## 6 M 8/3/2013 329110324 9/2/2013 9913 47.45
## 7 M 3/22/2013 693473613 4/21/2013 3107 47.45
## 8 C 6/16/2016 681941401 7/28/2016 470 47.45
## 9 L 4/3/2013 248335492 4/4/2013 6846 47.45
## 10 H 10/23/2013 858877503 11/6/2013 9794 47.45
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 31.79 136656.0 91555.20 45100.80
## 2 31.79 463064.5 310238.61 152825.94
## 3 31.79 395448.3 264937.86 130510.44
## 4 31.79 113452.9 76009.89 37443.06
## 5 31.79 339884.3 227711.77 112172.58
## 6 31.79 470371.8 315134.27 155237.58
## 7 31.79 147427.1 98771.53 48655.62
## 8 31.79 22301.5 14941.30 7360.20
## 9 31.79 324842.7 217634.34 107208.36
## 10 31.79 464725.3 311351.26 153374.04
data1000R %>%
group_by(Item.Type)
## # A tibble: 1,000 x 14
## # Groups: Item.Type [12]
## Region Country Item.Type Sales.Channel Order.Priority Order.Date Order.ID
## <chr> <chr> <chr> <chr> <chr> <chr> <int>
## 1 Middle Ea~ Libya Cosmetics Offline M 10/18/2014 6.87e8
## 2 North Ame~ Canada Vegetabl~ Online M 11/7/2011 1.86e8
## 3 Middle Ea~ Libya Baby Food Offline C 10/31/2016 2.46e8
## 4 Asia Japan Cereal Offline C 4/10/2010 1.61e8
## 5 Sub-Sahar~ Chad Fruits Offline H 8/16/2011 6.46e8
## 6 Europe Armenia Cereal Online H 11/24/2014 6.83e8
## 7 Sub-Sahar~ Eritrea Cereal Online H 3/4/2015 6.79e8
## 8 Europe Monten~ Clothes Offline M 5/17/2012 2.09e8
## 9 Central A~ Jamaica Vegetabl~ Online H 1/29/2015 2.66e8
## 10 Australia~ Fiji Vegetabl~ Offline H 12/24/2013 1.19e8
## # ... with 990 more rows, and 7 more variables: Ship.Date <chr>,
## # Units.Sold <int>, Unit.Price <dbl>, Unit.Cost <dbl>, Total.Revenue <dbl>,
## # Total.Cost <dbl>, Total.Profit <dbl>
Based on the item type “Cereal”, we observed that the price does not really change regardless of other factors. Meaning the unit price is fixed. We have verified the presumption for one item. How about other items sold by this store? We used groupby() function and since we didn’t get any error, we will assume the presumption is also verified for all items sold by the store. There might be a global view to see all items by individual table. Now we have verified the presumption, we can remove unnecessary variables.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
## The following object is masked from 'package:tsibble':
##
## key
## The following objects are masked from 'package:dplyr':
##
## between, first, last
# We want to check which item is popular.
data2 <- data.table( ItemType = data1000R$Item.Type)
data2[,.(count = .N), by = ItemType][, percent := prop.table(count)*100][]
## ItemType count percent
## 1: Cosmetics 75 7.5
## 2: Vegetables 97 9.7
## 3: Baby Food 87 8.7
## 4: Cereal 79 7.9
## 5: Fruits 70 7.0
## 6: Clothes 78 7.8
## 7: Snacks 82 8.2
## 8: Household 77 7.7
## 9: Office Supplies 89 8.9
## 10: Beverages 101 10.1
## 11: Personal Care 87 8.7
## 12: Meat 78 7.8
data1000R1 <- data1000R %>%
filter(Item.Type == "Beverages") %>%
select(-c(Region,Order.Priority, Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))
data1000R1 %>%
head(6)
## Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 2880 47.45 31.79 136656.0 91555.20 45100.80
## 2 9759 47.45 31.79 463064.5 310238.61 152825.94
## 3 8334 47.45 31.79 395448.3 264937.86 130510.44
## 4 2391 47.45 31.79 113452.9 76009.89 37443.06
## 5 7163 47.45 31.79 339884.3 227711.77 112172.58
## 6 9913 47.45 31.79 470371.8 315134.27 155237.58
This is a summary and correlation of the popular item known as “Beverage”
summary(data1000R1)
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 114 Min. :47.45 Min. :31.79 Min. : 5409
## 1st Qu.:2111 1st Qu.:47.45 1st Qu.:31.79 1st Qu.:100167
## Median :4571 Median :47.45 Median :31.79 Median :216894
## Mean :4999 Mean :47.45 Mean :31.79 Mean :237205
## 3rd Qu.:8282 3rd Qu.:47.45 3rd Qu.:31.79 3rd Qu.:392981
## Max. :9919 Max. :47.45 Max. :31.79 Max. :470657
## Total.Cost Total.Profit
## Min. : 3624 Min. : 1785
## 1st Qu.: 67109 1st Qu.: 33058
## Median :145312 Median : 71582
## Mean :158920 Mean : 78285
## 3rd Qu.:263285 3rd Qu.:129696
## Max. :315325 Max. :155332
#as.numeric(data1000R1$Units.Sold)
#library(Hmisc)
#data1 <- data.frame(data1000R1)
cor(data1000R1)
## Warning in cor(data1000R1): the standard deviation is zero
## Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost
## Units.Sold 1 NA NA 1 1
## Unit.Price NA 1 NA NA NA
## Unit.Cost NA NA 1 NA NA
## Total.Revenue 1 NA NA 1 1
## Total.Cost 1 NA NA 1 1
## Total.Profit 1 NA NA 1 1
## Total.Profit
## Units.Sold 1
## Unit.Price NA
## Unit.Cost NA
## Total.Revenue 1
## Total.Cost 1
## Total.Profit 1
#cor(data1000R1[,unlist(lapply(data1000R1, is.numeric))])
#rcorr(as.matrix(data1000R1), type = "Pearson")
Something is wrong with the correlation. we think the fact that the unit price is fixed might be the cause of such correlation output.
# # load package
# #install.packages("ggstatsplot")
# library(ggstatsplot)
#
# # correlogram
# ggstatsplot::ggcorrmat(
# data = data1000R1,
# type = "parametric", # parametric for Pearson, nonparametric for Spearman's correlation
# colors = c("darkred", "white", "steelblue") # change default colors
# )
set.seed(232)
library(caTools)
data1000R1s <- sample.split(data1000R1, SplitRatio = 0.70)
train1 <- subset(data1000R1, data1000R1s == TRUE)
test1 <- subset(data1000R1, data1000R1s == FALSE)
model1 <- lm(Total.Profit~., train1)
summary(model1)
## Warning in summary.lm(model1): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Total.Profit ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.067e-10 -5.138e-12 1.590e-12 9.750e-12 1.243e-10
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000e+00 1.044e-11 0.000e+00 1
## Units.Sold 1.566e+01 1.696e-15 9.235e+15 <2e-16 ***
## Unit.Price NA NA NA NA
## Unit.Cost NA NA NA NA
## Total.Revenue NA NA NA NA
## Total.Cost NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.207e-11 on 65 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.529e+31 on 1 and 65 DF, p-value: < 2.2e-16
plot (model1, which = 2)
plot (model1, which = 1)
There is something strange on the regression performance. The R-squared value is perfect showing only one variable (Unit.Sold) has influence on the total profit. The multilinear regression model could be just a simple linear regression model. This is a bit hard to admit. We want to try to call another function for partionning the data.
partition <- createDataPartition(data1000R1$Total.Profit, p = 0.70, list = FALSE)
train1s <- data1000R1[partition,]
test1s <- data1000R1[-partition,]
dim(train1s)
## [1] 73 6
dim(test1s)
## [1] 28 6
# Fitting the model
model1s <- lm(Total.Profit~Units.Sold
+ Unit.Price+Unit.Cost+Total.Revenue
+Total.Cost, data = train1s)
summary(model1s)
## Warning in summary.lm(model1s): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Price + Unit.Cost +
## Total.Revenue + Total.Cost, data = train1s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.283e-11 -6.076e-12 -1.812e-12 2.609e-12 1.079e-10
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.725e-11 3.371e-12 -8.083e+00 1.17e-11 ***
## Units.Sold 1.566e+01 5.796e-16 2.702e+16 < 2e-16 ***
## Unit.Price NA NA NA NA
## Unit.Cost NA NA NA NA
## Total.Revenue NA NA NA NA
## Total.Cost NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.557e-11 on 71 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.3e+32 on 1 and 71 DF, p-value: < 2.2e-16
plot(model1s, which = 2)
Same results!
pred1 <- predict(model1s, newdata = test1s)
## Warning in predict.lm(model1s, newdata = test1s): prediction from a rank-
## deficient fit may be misleading
check <- data.frame(test1s$Total.Profit, pred1, residuals = test1s$Total.Profit - pred1)
check %>%
head(8)
## test1s.Total.Profit pred1 residuals
## 6 155237.58 155237.58 -5.820766e-11
## 7 48655.62 48655.62 7.275958e-12
## 9 107208.36 107208.36 -1.455192e-11
## 13 14047.02 14047.02 2.182787e-11
## 17 75340.26 75340.26 -1.455192e-11
## 19 139436.64 139436.64 -2.910383e-11
## 31 65552.76 65552.76 0.000000e+00
## 32 58396.14 58396.14 0.000000e+00
MSE <- mean((test1s$Total.Profit - pred1)^2)
print(MSE)
## [1] 5.678004e-22
test1s$Predicted.TotalProfit <- predict(model1s,test1s)
## Warning in predict.lm(model1s, test1s): prediction from a rank-deficient fit may
## be misleading
actual_pred <- data.frame(test1s$Total.Profit, test1s$Predicted.TotalProfit)
names(actual_pred) <- c("Actual.Total.Profit", "Predicted.Total.Profit" )
accuracy1 <- cor(actual_pred)
accuracy1
## Actual.Total.Profit Predicted.Total.Profit
## Actual.Total.Profit 1 1
## Predicted.Total.Profit 1 1
head(actual_pred)
## Actual.Total.Profit Predicted.Total.Profit
## 1 155237.58 155237.58
## 2 48655.62 48655.62
## 3 107208.36 107208.36
## 4 14047.02 14047.02
## 5 75340.26 75340.26
## 6 139436.64 139436.64
test1s$Predicted.TotalProfit <- predict(model1s,test1s)
## Warning in predict.lm(model1s, test1s): prediction from a rank-deficient fit may
## be misleading
plot1s <-test1s %>%
ggplot(aes(Total.Profit,Predicted.TotalProfit)) +
geom_point(alpha=0.5) +
stat_smooth(aes(colour='black')) +
xlab('Actual value of Total Profit for Beverages') +
ylab('Predicted value of Beverages')+
theme_bw()
ggplotly(plot1s)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Unusual results we shall say. The correlation shows that Model accuracy is 100%. At this point, not sure what to think of. Another idea is that the approach was not good or the formulation of the problem statement was not correct or sufficiant.
We built a model for the most popular item sold by the store. What if we want to predict the total profit generated by all the items sold by the store. This is a bit complex but can be solvable. Let’s try to see if we will get the same result with a bigger dataset (one million records).
data1000000R <- read.csv("1000000 Sales Records.csv", stringsAsFactors=FALSE)
# Loading data
# df1 <- read.transactions('https://raw.githubusercontent.com/asmozo24/Data624_Market_Basket_Analysis/main/GroceryDataSet.csv',
# sep = ',', rm.duplicates = TRUE)
View(data1000000R)
#glimpse(basket)
str(data1000000R)
## 'data.frame': 1000000 obs. of 14 variables:
## $ Region : chr "Sub-Saharan Africa" "Middle East and North Africa" "Australia and Oceania" "Sub-Saharan Africa" ...
## $ Country : chr "South Africa" "Morocco" "Papua New Guinea" "Djibouti" ...
## $ Item.Type : chr "Fruits" "Clothes" "Meat" "Clothes" ...
## $ Sales.Channel : chr "Offline" "Online" "Offline" "Offline" ...
## $ Order.Priority: chr "M" "M" "M" "H" ...
## $ Order.Date : chr "7/27/2012" "9/14/2013" "5/15/2015" "5/17/2017" ...
## $ Order.ID : int 443368995 667593514 940995585 880811536 174590194 830192887 425793445 659878194 601245963 739008080 ...
## $ Ship.Date : chr "7/28/2012" "10/19/2013" "6/4/2015" "7/2/2017" ...
## $ Units.Sold : int 1593 4611 360 562 3973 1379 597 1476 896 7768 ...
## $ Unit.Price : num 9.33 109.28 421.89 109.28 47.45 ...
## $ Unit.Cost : num 6.92 35.84 364.69 35.84 31.79 ...
## $ Total.Revenue : num 14863 503890 151880 61415 188519 ...
## $ Total.Cost : num 11024 165258 131288 20142 126302 ...
## $ Total.Profit : num 3839 338632 20592 41273 62217 ...
#
# data1000R %>%
# head(8)%>%
# kable()
library('data.table')
# data1 <- data.table( orderPriority = data1000R$Order.Priority)
# data1[,.(count = .N), by = orderPriority][, percent := prop.table(count)*100][]
We will reproduce the same code used on the 1000 records dataset.
sum(is.na(data1000000R))
## [1] 0
data3 <- data.table( ItemType = data1000000R$Item.Type)
data3[,.(count = .N), by = ItemType][, percent := prop.table(count)*100][]
## ItemType count percent
## 1: Fruits 83551 8.3551
## 2: Clothes 83240 8.3240
## 3: Meat 83198 8.3198
## 4: Beverages 83326 8.3326
## 5: Office Supplies 83222 8.3222
## 6: Cosmetics 83431 8.3431
## 7: Snacks 83448 8.3448
## 8: Personal Care 83539 8.3539
## 9: Household 83267 8.3267
## 10: Vegetables 83170 8.3170
## 11: Baby Food 83397 8.3397
## 12: Cereal 83211 8.3211
data1000000R1 <- data1000000R %>%
filter(Item.Type == "Fruits") %>%
select(-c(Region,Order.Priority, Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))
data1000000R1 %>%
head(6)
## Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 1593 9.33 6.92 14862.69 11023.56 3839.13
## 2 1379 9.33 6.92 12866.07 9542.68 3323.39
## 3 8034 9.33 6.92 74957.22 55595.28 19361.94
## 4 5735 9.33 6.92 53507.55 39686.20 13821.35
## 5 9599 9.33 6.92 89558.67 66425.08 23133.59
## 6 3581 9.33 6.92 33410.73 24780.52 8630.21
partition2 <- createDataPartition(data1000000R1$Total.Profit, p = 0.70, list = FALSE)
train2s <- data1000000R1[partition,]
test2s <- data1000000R1[-partition,]
dim(train2s)
## [1] 73 6
dim(test2s)
## [1] 83478 6
# Fitting the model
model2s <- lm(Total.Profit~Units.Sold
+ Unit.Price+Unit.Cost+Total.Revenue
+Total.Cost, data = train1s)
summary(model2s)
## Warning in summary.lm(model2s): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Price + Unit.Cost +
## Total.Revenue + Total.Cost, data = train1s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.283e-11 -6.076e-12 -1.812e-12 2.609e-12 1.079e-10
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.725e-11 3.371e-12 -8.083e+00 1.17e-11 ***
## Units.Sold 1.566e+01 5.796e-16 2.702e+16 < 2e-16 ***
## Unit.Price NA NA NA NA
## Unit.Cost NA NA NA NA
## Total.Revenue NA NA NA NA
## Total.Cost NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.557e-11 on 71 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.3e+32 on 1 and 71 DF, p-value: < 2.2e-16
plot(model2s, which = 2)
pred2 <- predict(model2s, newdata = test2s)
## Warning in predict.lm(model2s, newdata = test2s): prediction from a rank-
## deficient fit may be misleading
check2 <- data.frame(test2s$Total.Profit, pred2, residuals = test2s$Total.Profit - pred2)
check2 %>%
head(8)
## test2s.Total.Profit pred2 residuals
## 6 8630.21 56078.46 -47448.25
## 7 23827.67 154830.42 -131002.75
## 9 3747.55 24351.30 -20603.75
## 13 11423.40 74228.40 -62805.00
## 17 7389.06 48013.56 -40624.50
## 19 15978.30 103825.80 -87847.50
## 31 13170.65 85581.90 -72411.25
## 32 3284.83 21344.58 -18059.75
MSE <- mean((test2s$Total.Profit - pred2)^2)
print(MSE)
## [1] 5841777225
test2s$Predicted.TotalProfit <- predict(model2s,test2s)
## Warning in predict.lm(model2s, test2s): prediction from a rank-deficient fit may
## be misleading
actual_pred2 <- data.frame(test2s$Total.Profit, test2s$Predicted.TotalProfit)
names(actual_pred2) <- c("Actual.Total.Profit", "Predicted.Total.Profit" )
accuracy2 <- cor(actual_pred2)
accuracy2
## Actual.Total.Profit Predicted.Total.Profit
## Actual.Total.Profit 1 1
## Predicted.Total.Profit 1 1
head(actual_pred2)
## Actual.Total.Profit Predicted.Total.Profit
## 1 8630.21 56078.46
## 2 23827.67 154830.42
## 3 3747.55 24351.30
## 4 11423.40 74228.40
## 5 7389.06 48013.56
## 6 15978.30 103825.80
test2s$Predicted.TotalProfit <- predict(model2s,test2s)
## Warning in predict.lm(model2s, test2s): prediction from a rank-deficient fit may
## be misleading
plot2s <-test2s %>%
ggplot(aes(Total.Profit,Predicted.TotalProfit)) +
geom_point(alpha=0.5) +
stat_smooth(aes(colour='black')) +
xlab('Actual value of Total Profit for Fruits') +
ylab('Predicted value of Fruits')+
theme_bw()
ggplotly(plot2s)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We obtain the Same result with one million records compared to the 1000 records.
Another problem we could articulate on these dataset is: Predict if the next customer purchase is a popular item or not. In order to do this, we would have to generate a new variable called “Popularity” which can take value 1 or 0 or yes/no. This variable would be based on certain criteria that we wish there were given. Rather, we can try to determine the priority of the next customer order. We can use decision tree to predict the priority of the next customer order.
data3 <- data.table( OrderPriority = data1000R$Order.Priority)
data3[,.(count = .N), by = OrderPriority][, percent := prop.table(count)*100][]
## OrderPriority count percent
## 1: M 242 24.2
## 2: C 262 26.2
## 3: H 228 22.8
## 4: L 268 26.8
#data1000R2 <- data1000R %>%
# group_by(Order.Priority)
Looking at the features within the dataset, it is bit hard to say whether variables like Region, Country , Sale.Channel, Item.Type, Order.ID, and Order.Date. We say so because there is no really trend or logic observed for each of this variable having influence over the order priority. We make assumption that the variables we used for the regression analysis also have influence on the order priority.
data1000R$Order.Priority <- as.factor(data1000R$Order.Priority)
#str(data1000R2)
data1000R2 <- data1000R %>%
select(-c(Region,Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))
data1000R2 %>%
head(6)
## Order.Priority Units.Sold Unit.Price Unit.Cost Total.Revenue Total.Cost
## 1 M 8446 437.20 263.33 3692591.20 2224085.2
## 2 M 3018 154.06 90.93 464953.08 274426.7
## 3 C 1517 255.28 159.42 387259.76 241840.1
## 4 C 3322 205.70 117.11 683335.40 389039.4
## 5 H 9845 9.33 6.92 91853.85 68127.4
## 6 H 9528 205.70 117.11 1959909.60 1115824.1
## Total.Profit
## 1 1468506.02
## 2 190526.34
## 3 145419.62
## 4 294295.98
## 5 23726.45
## 6 844085.52
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
##
## Attaching package: 'modeltools'
## The following object is masked from 'package:arules':
##
## info
## The following object is masked from 'package:plyr':
##
## empty
## The following object is masked from 'package:BayesFactor':
##
## posterior
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
data4 = sample.split(data1000R2, SplitRatio = 0.70)
train4 <- subset(data1000R2, data4 == TRUE)
test4 <- subset(data1000R2, data4 == FALSE)
model4 <- ctree(Order.Priority ~ ., train4)
plot(model4)
pred4 <- predict(model4, test4)
classifier <- table(test4$Order.Priority, pred4)
classifier
## pred4
## C H L M
## C 0 0 119 0
## H 0 0 95 0
## L 0 0 114 0
## M 0 0 101 0
The model4 correctly predicted the next customer order to be only Order Priority “C”.
accuracy4 <- sum(diag(classifier))/sum(classifier)
accuracy4
## [1] 0.2657343
The model 4 accuracy is about 21.5% which is relative low.
data1000000R$Order.Priority <- as.factor(data1000000R$Order.Priority)
#str(data1000R2)
data1000000R2 <- data1000000R %>%
select(-c(Region,Order.Date, Order.ID, Ship.Date, Item.Type, Sales.Channel,Country))
data5 = sample.split(data1000000R2, SplitRatio = 0.70)
train5 <- subset(data1000000R2, data5 == TRUE)
test5 <- subset(data1000000R2, data5 == FALSE)
model5 <- ctree(Order.Priority ~ ., train4)
summary(model5)
## Length Class Mode
## 1 BinaryTree S4
plot(model5)
pred5 <- predict(model5, test5)
classifier2 <- table(test5$Order.Priority, pred5)
classifier2
## pred5
## C H L M
## C 0 0 107226 0
## H 0 0 106780 0
## L 0 0 107478 0
## M 0 0 107088 0
accuracy5 <- sum(diag(classifier2))/sum(classifier2)
accuracy5
## [1] 0.2507817
We got the same result with one million records compared to 1000 records. In conclusion, We want to say that the increasing the number of records for this dataset did not have an influence on the performance of the two machine learning algorithms (multilinear regression and decision tree).
https://rpubs.com/ezrasote/housepricing
https://towardsdatascience.com/exploratory-data-analysis-in-r-for-beginners-fe031add7072
https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/
https://pages.mtu.edu/~shanem/psy5220/daily/Day12/classification.html
https://datascienceplus.com/how-to-perform-logistic-regression-lda-qda-in-r/#:~:text=LDA%20(Linear%20Discriminant%20Analysis)%20is,for%20all%20class%20is%20normal.
https://uc-r.github.io/naive_bayes
https://techvidvan.com/tutorials/classification-in-r/
https://www.geeksforgeeks.org/decision-tree-in-r-programming/
https://rstudio-pubs-static.s3.amazonaws.com/259348_2127bacd02b6420ea19851f8534a9b68.html
https://www.edureka.co/blog/random-forest-classifier/
https://www.guru99.com/r-decision-trees.html
https://www.geeksforgeeks.org/decision-tree-classifiers-in-r-programming/?ref=rp