library(tidyverse)
library(cowplot)
library(lattice)
library(reshape2)
library(corrplot)
library(randomForest)
library(caTools)
library(nnet)
library(caret)
For this assignment I have chosen to work with the Sales dataset. I am working with the 100 records dataset and 5000 records dataset. The Sales dataset is randomly generated and can be found at https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/. Each row contains metadata for a unique sales including unique identification number, total cost, item type, etc.
small_sales <- read.csv('100_Sales.csv')
dim(small_sales)
## [1] 100 14
large_sales <- read.csv('5000_Sales.csv')
dim(large_sales)
## [1] 5000 14
The dataset contains the following columns:
names <- data.frame(Small = colnames(small_sales), Large = colnames(large_sales))
names
## Small Large
## 1 Region Region
## 2 Country Country
## 3 Item.Type Item.Type
## 4 Sales.Channel Sales.Channel
## 5 Order.Priority Order.Priority
## 6 Order.Date Order.Date
## 7 Order.ID Order.ID
## 8 Ship.Date Ship.Date
## 9 Units.Sold Units.Sold
## 10 Unit.Price Unit.Price
## 11 Unit.Cost Unit.Cost
## 12 Total.Revenue Total.Revenue
## 13 Total.Cost Total.Cost
## 14 Total.Profit Total.Profit
## Exploratary Analysis
As the first step of exploratory analysis we will take a glimpse at the data.
head(small_sales,3)
## Region Country Item.Type Sales.Channel
## 1 Australia and Oceania Tuvalu Baby Food Offline
## 2 Central America and the Caribbean Grenada Cereal Online
## 3 Europe Russia Office Supplies Offline
## Order.Priority Order.Date Order.ID Ship.Date Units.Sold Unit.Price Unit.Cost
## 1 H 5/28/2010 669165933 6/27/2010 9925 255.28 159.42
## 2 C 8/22/2012 963881480 9/15/2012 2804 205.70 117.11
## 3 L 5/2/2014 341417157 5/8/2014 1779 651.21 524.96
## Total.Revenue Total.Cost Total.Profit
## 1 2533654.0 1582243.5 951410.5
## 2 576782.8 328376.4 248406.4
## 3 1158502.6 933903.8 224598.8
head(large_sales,3)
## Region Country Item.Type
## 1 Central America and the Caribbean Antigua and Barbuda Baby Food
## 2 Central America and the Caribbean Panama Snacks
## 3 Europe Czech Republic Beverages
## Sales.Channel Order.Priority Order.Date Order.ID Ship.Date Units.Sold
## 1 Online M 12/20/2013 957081544 1/11/2014 552
## 2 Offline C 7/5/2010 301644504 7/26/2010 2167
## 3 Offline C 9/12/2011 478051030 9/29/2011 4778
## Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 255.28 159.42 140914.6 87999.84 52914.72
## 2 152.58 97.44 330640.9 211152.48 119488.38
## 3 47.45 31.79 226716.1 151892.62 74823.48
Looking at the datatypes of the columns, some necessary changes can be seen. For easier analysis, we will convert the ‘Order.Date’ and ‘Ship.Date’ columns from character to Date and the Order.ID column from an int to character. For other categorical columns we will covert them to the factor datatype.
small_sales$Order.Date <- as.Date(small_sales[['Order.Date']], "%m/%d/%Y")
small_sales$Ship.Date <- as.Date(small_sales[['Ship.Date']], "%m/%d/%Y")
small_sales$Order.ID <- toString(small_sales[['Order.ID']])
small_sales$Region <- as.factor(small_sales[['Region']])
small_sales$Country <- as.factor(small_sales[['Country']])
small_sales$Item.Type <- as.factor(small_sales[['Item.Type']])
small_sales$Sales.Channel <- as.factor(small_sales[['Sales.Channel']])
small_sales$Order.Priority <- as.factor(small_sales[['Order.Priority']])
large_sales$Order.Date <- as.Date(large_sales[['Order.Date']], "%m/%d/%Y")
large_sales$Ship.Date <- as.Date(large_sales[['Ship.Date']], "%m/%d/%Y")
large_sales$Order.ID <- toString(large_sales[['Order.ID']])
large_sales$Region <- as.factor(large_sales[['Region']])
large_sales$Country <- as.factor(large_sales[['Country']])
large_sales$Item.Type <- as.factor(large_sales[['Item.Type']])
large_sales$Sales.Channel <- as.factor(large_sales[['Sales.Channel']])
large_sales$Order.Priority <- as.factor(large_sales[['Order.Priority']])
In order to make sure we have complete data we will do a check for any missing values.
sum(is.na(small_sales))
## [1] 0
sum(is.na(large_sales))
## [1] 0
Now that we have confirmed that no missing values exist in both datasets and have properly converted the data columns, we can start with the statistical and visual summaries.
###Summary
summary(select_if(small_sales, is.numeric))
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 124 Min. : 9.33 Min. : 6.92 Min. : 4870
## 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 268721
## Median :5382 Median :179.88 Median :107.28 Median : 752314
## Mean :5129 Mean :276.76 Mean :191.05 Mean :1373488
## 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33 3rd Qu.:2212045
## Max. :9925 Max. :668.27 Max. :524.96 Max. :5997055
## Total.Cost Total.Profit
## Min. : 3612 Min. : 1258
## 1st Qu.: 168868 1st Qu.: 121444
## Median : 363566 Median : 290768
## Mean : 931806 Mean : 441682
## 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :4509794 Max. :1719922
num1 <- ggplot(data = stack(small_sales %>% select(Total.Revenue,Total.Cost,Total.Profit)), aes(x = ind, y = values)) + geom_boxplot(color="red", fill="orange", alpha=0.2) + theme_classic()
num2 <- ggplot(data = stack(small_sales %>% select(Units.Sold,Unit.Price,Unit.Cost)), aes(x = ind, y = values)) + geom_boxplot(color="red", fill="orange", alpha=0.2) + theme_classic()
plot_grid(num1, num2, labels = c('Total', 'Units'))
as.data.frame(table(small_sales$Country))
## Var1 Freq
## 1 Albania 1
## 2 Angola 1
## 3 Australia 3
## 4 Austria 1
## 5 Azerbaijan 2
## 6 Bangladesh 1
## 7 Belize 1
## 8 Brunei 1
## 9 Bulgaria 2
## 10 Burkina Faso 1
## 11 Cameroon 2
## 12 Cape Verde 1
## 13 Comoros 1
## 14 Costa Rica 1
## 15 Cote d'Ivoire 1
## 16 Democratic Republic of the Congo 1
## 17 Djibouti 3
## 18 East Timor 1
## 19 Federated States of Micronesia 1
## 20 Fiji 1
## 21 France 1
## 22 Gabon 1
## 23 Grenada 1
## 24 Haiti 1
## 25 Honduras 2
## 26 Iceland 1
## 27 Iran 1
## 28 Kenya 1
## 29 Kiribati 1
## 30 Kuwait 1
## 31 Kyrgyzstan 1
## 32 Laos 1
## 33 Lebanon 1
## 34 Lesotho 1
## 35 Libya 2
## 36 Lithuania 1
## 37 Macedonia 1
## 38 Madagascar 1
## 39 Malaysia 1
## 40 Mali 2
## 41 Mauritania 1
## 42 Mexico 3
## 43 Moldova 1
## 44 Monaco 1
## 45 Mongolia 1
## 46 Mozambique 1
## 47 Myanmar 2
## 48 New Zealand 1
## 49 Nicaragua 1
## 50 Niger 1
## 51 Norway 2
## 52 Pakistan 1
## 53 Portugal 1
## 54 Republic of the Congo 1
## 55 Romania 1
## 56 Russia 1
## 57 Rwanda 2
## 58 Samoa 1
## 59 San Marino 1
## 60 Sao Tome and Principe 3
## 61 Saudi Arabia 1
## 62 Senegal 1
## 63 Sierra Leone 3
## 64 Slovakia 1
## 65 Slovenia 1
## 66 Solomon Islands 1
## 67 South Sudan 1
## 68 Spain 1
## 69 Sri Lanka 1
## 70 Switzerland 2
## 71 Syria 1
## 72 The Gambia 4
## 73 Turkmenistan 2
## 74 Tuvalu 1
## 75 United Kingdom 1
## 76 Zambia 1
region <- ggplot(small_sales, aes(x=Region)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
type <- ggplot(small_sales, aes(x=Item.Type)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
channel <- ggplot(small_sales, aes(x=Sales.Channel)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
priority <- ggplot(small_sales, aes(x=Order.Priority)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
plot_grid(region, type, channel, priority, labels = c( 'Region','Type', 'Channel', 'Priority'))
corrplot(cor(dplyr::select_if(small_sales, is.numeric)),
method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5, col=colorRampPalette(c("blue","white","red"))(200))
summary(select_if(large_sales, is.numeric))
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 2 Min. : 9.33 Min. : 6.92 Min. : 65
## 1st Qu.:2453 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 257417
## Median :5123 Median :154.06 Median : 97.44 Median : 779409
## Mean :5031 Mean :265.75 Mean :187.49 Mean :1325738
## 3rd Qu.:7576 3rd Qu.:437.20 3rd Qu.:263.33 3rd Qu.:1839975
## Max. :9999 Max. :668.27 Max. :524.96 Max. :6672676
## Total.Cost Total.Profit
## Min. : 48 Min. : 16.9
## 1st Qu.: 154748 1st Qu.: 85339.3
## Median : 468181 Median : 279095.2
## Mean : 933093 Mean : 392644.6
## 3rd Qu.:1189578 3rd Qu.: 565106.4
## Max. :5248025 Max. :1726007.5
num1 <- ggplot(data = stack(large_sales %>% select(Total.Revenue,Total.Cost,Total.Profit)), aes(x = ind, y = values)) + geom_boxplot(color="darkgreen", fill="lightgreen", alpha=0.2) + theme_classic()
num2 <- ggplot(data = stack(large_sales %>% select(Units.Sold,Unit.Price,Unit.Cost)), aes(x = ind, y = values)) + geom_boxplot(color="darkgreen", fill="lightgreen", alpha=0.2) + theme_classic()
plot_grid(num1, num2, labels = c('Total', 'Units'))
as.data.frame(table(large_sales$Country))
## Var1 Freq
## 1 Afghanistan 19
## 2 Albania 21
## 3 Algeria 28
## 4 Andorra 40
## 5 Angola 16
## 6 Antigua and Barbuda 26
## 7 Armenia 29
## 8 Australia 34
## 9 Austria 28
## 10 Azerbaijan 27
## 11 Bahrain 18
## 12 Bangladesh 22
## 13 Barbados 26
## 14 Belarus 28
## 15 Belgium 26
## 16 Belize 20
## 17 Benin 27
## 18 Bhutan 30
## 19 Bosnia and Herzegovina 33
## 20 Botswana 31
## 21 Brunei 34
## 22 Bulgaria 30
## 23 Burkina Faso 26
## 24 Burundi 26
## 25 Cambodia 26
## 26 Cameroon 31
## 27 Canada 19
## 28 Cape Verde 14
## 29 Central African Republic 26
## 30 Chad 27
## 31 China 27
## 32 Comoros 25
## 33 Costa Rica 26
## 34 Cote d'Ivoire 32
## 35 Croatia 18
## 36 Cuba 28
## 37 Cyprus 30
## 38 Czech Republic 31
## 39 Democratic Republic of the Congo 23
## 40 Denmark 26
## 41 Djibouti 26
## 42 Dominica 18
## 43 Dominican Republic 29
## 44 East Timor 27
## 45 Egypt 25
## 46 El Salvador 24
## 47 Equatorial Guinea 33
## 48 Eritrea 19
## 49 Estonia 23
## 50 Ethiopia 26
## 51 Federated States of Micronesia 20
## 52 Fiji 28
## 53 Finland 23
## 54 France 26
## 55 Gabon 22
## 56 Georgia 23
## 57 Germany 26
## 58 Ghana 38
## 59 Greece 32
## 60 Greenland 26
## 61 Grenada 36
## 62 Guatemala 28
## 63 Guinea 33
## 64 Guinea-Bissau 29
## 65 Haiti 36
## 66 Honduras 33
## 67 Hungary 25
## 68 Iceland 23
## 69 India 25
## 70 Indonesia 28
## 71 Iran 29
## 72 Iraq 29
## 73 Ireland 30
## 74 Israel 23
## 75 Italy 30
## 76 Jamaica 27
## 77 Japan 21
## 78 Jordan 31
## 79 Kazakhstan 24
## 80 Kenya 28
## 81 Kiribati 25
## 82 Kosovo 35
## 83 Kuwait 26
## 84 Kyrgyzstan 29
## 85 Laos 34
## 86 Latvia 27
## 87 Lebanon 31
## 88 Lesotho 23
## 89 Liberia 27
## 90 Libya 32
## 91 Liechtenstein 24
## 92 Lithuania 25
## 93 Luxembourg 28
## 94 Macedonia 32
## 95 Madagascar 24
## 96 Malawi 24
## 97 Malaysia 27
## 98 Maldives 20
## 99 Mali 33
## 100 Malta 32
## 101 Marshall Islands 26
## 102 Mauritania 29
## 103 Mauritius 38
## 104 Mexico 23
## 105 Moldova 24
## 106 Monaco 13
## 107 Mongolia 15
## 108 Montenegro 28
## 109 Morocco 21
## 110 Mozambique 25
## 111 Myanmar 33
## 112 Namibia 21
## 113 Nauru 25
## 114 Nepal 28
## 115 Netherlands 24
## 116 New Zealand 26
## 117 Nicaragua 20
## 118 Niger 35
## 119 Nigeria 23
## 120 North Korea 24
## 121 Norway 30
## 122 Oman 31
## 123 Pakistan 27
## 124 Palau 25
## 125 Panama 32
## 126 Papua New Guinea 20
## 127 Philippines 24
## 128 Poland 28
## 129 Portugal 25
## 130 Qatar 32
## 131 Republic of the Congo 31
## 132 Romania 34
## 133 Russia 32
## 134 Rwanda 30
## 135 Saint Kitts and Nevis 32
## 136 Saint Lucia 29
## 137 Saint Vincent and the Grenadines 22
## 138 Samoa 35
## 139 San Marino 40
## 140 Sao Tome and Principe 24
## 141 Saudi Arabia 23
## 142 Senegal 30
## 143 Serbia 29
## 144 Seychelles 30
## 145 Sierra Leone 23
## 146 Singapore 23
## 147 Slovakia 30
## 148 Slovenia 27
## 149 Solomon Islands 31
## 150 Somalia 36
## 151 South Africa 23
## 152 South Korea 36
## 153 South Sudan 25
## 154 Spain 26
## 155 Sri Lanka 31
## 156 Sudan 34
## 157 Swaziland 25
## 158 Sweden 29
## 159 Switzerland 28
## 160 Syria 19
## 161 Taiwan 31
## 162 Tajikistan 25
## 163 Tanzania 27
## 164 Thailand 25
## 165 The Bahamas 23
## 166 The Gambia 19
## 167 Togo 30
## 168 Tonga 37
## 169 Trinidad and Tobago 19
## 170 Tunisia 31
## 171 Turkey 25
## 172 Turkmenistan 31
## 173 Tuvalu 22
## 174 Uganda 22
## 175 Ukraine 33
## 176 United Arab Emirates 25
## 177 United Kingdom 23
## 178 United States of America 38
## 179 Uzbekistan 27
## 180 Vanuatu 35
## 181 Vatican City 23
## 182 Vietnam 19
## 183 Yemen 22
## 184 Zambia 31
## 185 Zimbabwe 21
region <- ggplot(large_sales, aes(x=Region)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
type <- ggplot(large_sales, aes(x=Item.Type)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
channel <- ggplot(large_sales, aes(x=Sales.Channel)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
priority <- ggplot(large_sales, aes(x=Order.Priority)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
plot_grid(region, type, channel, priority, labels = c( 'Region','Type', 'Channel', 'Priority'))
corrplot(cor(dplyr::select_if(large_sales, is.numeric)),
method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5, col=colorRampPalette(c("yellow","white","darkgreen"))(200))
set.seed(123)
train_ind <- sample(seq_len(nrow(small_sales)), size = floor(0.85 * nrow(small_sales)))
s_train <- small_sales[train_ind, ]
s_test <- small_sales[-train_ind, ]
train_ind <- sample(seq_len(nrow(large_sales)), size = floor(0.85 * nrow(large_sales)))
l_train <- large_sales[train_ind, ]
l_test <- large_sales[-train_ind, ]
small_lm <- lm(Total.Profit ~ Region + Item.Type + Sales.Channel + Order.Priority, data = s_train)
#summary(small_lm)
s_test$Predicted <- predict(small_lm, newdata = s_test)
# Model performance
# (a) Prediction error, RMSE
RMSE(s_test$Predicted, s_test$Total.Profit)
## [1] 307989.5
# (b) R-square
R2(s_test$Predicted, s_test$Total.Profit)
## [1] 0.1623618
large_lm <- lm(Total.Profit ~ Region + Item.Type + Sales.Channel + Order.Priority, data = l_train)
#summary(large_lm)
l_test$Predicted <- predict(large_lm, newdata = l_test)
# Model performance
# (a) Prediction error, RMSE
RMSE(l_test$Predicted, l_test$Total.Profit)
## [1] 262668.9
# (b) R-square
R2(l_test$Predicted, l_test$Total.Profit)
## [1] 0.4640496
set.seed(123)
train_ind <- sample(seq_len(nrow(small_sales)), size = floor(0.85 * nrow(small_sales)))
s_train <- small_sales[train_ind, ]
s_test <- small_sales[-train_ind, ]
train_ind <- sample(seq_len(nrow(large_sales)), size = floor(0.85 * nrow(large_sales)))
l_train <- large_sales[train_ind, ]
l_test <- large_sales[-train_ind, ]
small_mlm <- multinom(Order.Priority ~ Region + Item.Type +
Units.Sold + Unit.Price +
Unit.Cost + Total.Cost +
Total.Profit + Total.Revenue,data=s_train)
## # weights: 100 (72 variable)
## initial value 117.835021
## iter 10 value 111.239071
## iter 20 value 80.808460
## iter 30 value 75.873256
## iter 40 value 75.490182
## iter 50 value 75.465676
## iter 60 value 75.460549
## iter 70 value 75.460331
## final value 75.460323
## converged
#summary(small_mlm)
# Prediction
s_test$Predicted <- predict(small_mlm, newdata = s_test, "class")
tab <- table(s_test$Order.Priority, s_test$Predicted)
tab
##
## C H L M
## C 0 3 0 0
## H 2 3 1 0
## L 0 0 3 0
## M 0 2 1 0
# Accuracy
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 40
large_mlm <- multinom(Order.Priority ~ Region + Item.Type +
Units.Sold + Unit.Price +
Unit.Cost + Total.Cost +
Total.Profit + Total.Revenue,data=l_train)
## # weights: 100 (72 variable)
## initial value 5891.751035
## iter 10 value 5886.284972
## iter 20 value 5866.867730
## iter 30 value 5860.619005
## iter 40 value 5858.992679
## final value 5858.942520
## converged
#summary(large_mlm)
# Prediction
l_test$Predicted <- predict(large_mlm, newdata = l_test, "class")
tab2 <- table(l_test$Order.Priority, l_test$Predicted)
tab2
##
## C H L M
## C 28 29 37 72
## H 30 23 57 84
## L 24 31 53 76
## M 32 37 45 92
# Accuracy
round((sum(diag(tab2))/sum(tab2))*100,2)
## [1] 26.13
This machine learning algorithm analysis works with the Sales dataset. This dataset is randomly generated and can be found at https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/. This analysis requires two opposing sizes of data and for that I have chosen the 100 records and 5000 records set respectively.
If we take a look at the dataset we can see that it contains 14 columns and 100 and 5000 rows respectively. The data contains information on sales transactions including associate dates, items, costs, and location. Of the fourteen columns we have two date columns, five categorical columns, six numeric columns, and one identification column.
To start off, the analysis data columns were converted to better suited data types, and checked for any null values. Each dataset did not include any missing values. In order to see the spread of the data visual and statistical summaries were performed. If we look at the box plots of the numeric columns we can see that each column is right skewed across both datasets. The categorical data is also similar across both datasets.
These datasets are very practical for machine learning analysis as they are similar to real everyday data that would require analysis. Businesses can use such data for forecasting futures sales and predicting consumer behavior. For this analysis I have decided to implement two machine learning algorithms: Multiple Linear Regression and Multinomial Logistic Regression. Each implementation has everyday uses.
Multiple Linear Regression is a regression model that estimates the relationship between a continuous dependent variable and two or more independent variables. It is a supervised machine learning algorithm and aims to find the line of best fit between the dependent and independent variables. For this implementation we are aiming to model the relationship between the Total.Profit variable and the set of independent variables: Region, Item.Type, Sales.Channel, and Order.Priority. For each dataset, the data was split into training and testing subsets based on a 85:15 ratio. The model was trained on the training data and predictions were made using the testing data. The RMSE and R-squared values were used to evaluate the model. The model for the small data set had a RMSE of 307989.5 and a R-squared of 0.1623618. The model for the large data set had a RMSE of 262668.9 and a R-squared of 0.4640496. For this algorithm the dataset with the larger records did better with a lower RMSE and R-squared. Neither of the models gave a relatively high R-squared, but seeing that the larger dataset performed better, it would be reasonable to predict that an even larger dataset would yield better results.
Multinomial Logistic Regression is a machine learning algorithm used to predict the categorical values for a dependent variable. It is a type of logistic regression where the dependent variable has more than two categories. For this implementation we are aiming to predict the Order.Priority value based on the following columns: Region, Item.Type, Units.Sold, Unit.Price, Unit.Cost, Total.Cost, Total.Profit, and Total.Revenue. The data is split into training and testing data based on a 85:15 ratio. The model is trained on the training data and predictions were made on the testing data. The model with the smaller data had a 40% accuracy in its predictions. The model with the larger data had a 26.13% accuracy in its predictions. The models seem not to work well with the larger dataset. We can test this theory in the future by using an even larger dataset.
Each algorithm has its own pros and cons. MLM seems to work better with data, while MLR works better with smaller data. Both algorithms were chosen because of the practical solutions they provide for everyday problems. While simple, these models provide a rich analysis and insight for future decisions.