Data 622 Homework 1

library(tidyverse)
library(cowplot)
library(lattice)
library(reshape2)
library(corrplot)
library(randomForest)
library(caTools)
library(nnet)
library(caret)

Data Structure and Content

For this assignment I have chosen to work with the Sales dataset. I am working with the 100 records dataset and 5000 records dataset. The Sales dataset is randomly generated and can be found at https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/. Each row contains metadata for a unique sales including unique identification number, total cost, item type, etc.

small_sales <- read.csv('100_Sales.csv')
dim(small_sales)

## [1] 100  14

large_sales <- read.csv('5000_Sales.csv')
dim(large_sales)

## [1] 5000   14

The dataset contains the following columns:

names <- data.frame(Small = colnames(small_sales), Large = colnames(large_sales))
names

##             Small          Large
## 1          Region         Region
## 2         Country        Country
## 3       Item.Type      Item.Type
## 4   Sales.Channel  Sales.Channel
## 5  Order.Priority Order.Priority
## 6      Order.Date     Order.Date
## 7        Order.ID       Order.ID
## 8       Ship.Date      Ship.Date
## 9      Units.Sold     Units.Sold
## 10     Unit.Price     Unit.Price
## 11      Unit.Cost      Unit.Cost
## 12  Total.Revenue  Total.Revenue
## 13     Total.Cost     Total.Cost
## 14   Total.Profit   Total.Profit

## Exploratary Analysis

Glimpse

As the first step of exploratory analysis we will take a glimpse at the data.

head(small_sales,3)

##                              Region Country       Item.Type Sales.Channel
## 1             Australia and Oceania  Tuvalu       Baby Food       Offline
## 2 Central America and the Caribbean Grenada          Cereal        Online
## 3                            Europe  Russia Office Supplies       Offline
##   Order.Priority Order.Date  Order.ID Ship.Date Units.Sold Unit.Price Unit.Cost
## 1              H  5/28/2010 669165933 6/27/2010       9925     255.28    159.42
## 2              C  8/22/2012 963881480 9/15/2012       2804     205.70    117.11
## 3              L   5/2/2014 341417157  5/8/2014       1779     651.21    524.96
##   Total.Revenue Total.Cost Total.Profit
## 1     2533654.0  1582243.5     951410.5
## 2      576782.8   328376.4     248406.4
## 3     1158502.6   933903.8     224598.8

head(large_sales,3)

##                              Region              Country Item.Type
## 1 Central America and the Caribbean Antigua and Barbuda  Baby Food
## 2 Central America and the Caribbean               Panama    Snacks
## 3                            Europe       Czech Republic Beverages
##   Sales.Channel Order.Priority Order.Date  Order.ID Ship.Date Units.Sold
## 1        Online              M 12/20/2013 957081544 1/11/2014        552
## 2       Offline              C   7/5/2010 301644504 7/26/2010       2167
## 3       Offline              C  9/12/2011 478051030 9/29/2011       4778
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     255.28    159.42      140914.6   87999.84     52914.72
## 2     152.58     97.44      330640.9  211152.48    119488.38
## 3      47.45     31.79      226716.1  151892.62     74823.48

Convert Column Datatype

Looking at the datatypes of the columns, some necessary changes can be seen. For easier analysis, we will convert the ‘Order.Date’ and ‘Ship.Date’ columns from character to Date and the Order.ID column from an int to character. For other categorical columns we will covert them to the factor datatype.

small_sales$Order.Date <- as.Date(small_sales[['Order.Date']], "%m/%d/%Y")
small_sales$Ship.Date <- as.Date(small_sales[['Ship.Date']], "%m/%d/%Y")
small_sales$Order.ID <- toString(small_sales[['Order.ID']])
small_sales$Region <- as.factor(small_sales[['Region']])
small_sales$Country <- as.factor(small_sales[['Country']])
small_sales$Item.Type <- as.factor(small_sales[['Item.Type']])
small_sales$Sales.Channel <- as.factor(small_sales[['Sales.Channel']])
small_sales$Order.Priority <- as.factor(small_sales[['Order.Priority']])

large_sales$Order.Date <- as.Date(large_sales[['Order.Date']], "%m/%d/%Y")
large_sales$Ship.Date <- as.Date(large_sales[['Ship.Date']], "%m/%d/%Y")
large_sales$Order.ID <- toString(large_sales[['Order.ID']])
large_sales$Region <- as.factor(large_sales[['Region']])
large_sales$Country <- as.factor(large_sales[['Country']])
large_sales$Item.Type <- as.factor(large_sales[['Item.Type']])
large_sales$Sales.Channel <- as.factor(large_sales[['Sales.Channel']])
large_sales$Order.Priority <- as.factor(large_sales[['Order.Priority']])

NA

In order to make sure we have complete data we will do a check for any missing values.

sum(is.na(small_sales))

## [1] 0

sum(is.na(large_sales))

## [1] 0

Now that we have confirmed that no missing values exist in both datasets and have properly converted the data columns, we can start with the statistical and visual summaries.

###Summary

Small Dataset

summary(select_if(small_sales, is.numeric))

##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   : 124   Min.   :  9.33   Min.   :  6.92   Min.   :   4870  
##  1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 268721  
##  Median :5382   Median :179.88   Median :107.28   Median : 752314  
##  Mean   :5129   Mean   :276.76   Mean   :191.05   Mean   :1373488  
##  3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:2212045  
##  Max.   :9925   Max.   :668.27   Max.   :524.96   Max.   :5997055  
##    Total.Cost       Total.Profit    
##  Min.   :   3612   Min.   :   1258  
##  1st Qu.: 168868   1st Qu.: 121444  
##  Median : 363566   Median : 290768  
##  Mean   : 931806   Mean   : 441682  
##  3rd Qu.:1613870   3rd Qu.: 635829  
##  Max.   :4509794   Max.   :1719922

num1 <- ggplot(data = stack(small_sales %>% select(Total.Revenue,Total.Cost,Total.Profit)), aes(x = ind, y = values)) + geom_boxplot(color="red", fill="orange", alpha=0.2) + theme_classic()
num2 <- ggplot(data = stack(small_sales %>% select(Units.Sold,Unit.Price,Unit.Cost)), aes(x = ind, y = values)) + geom_boxplot(color="red", fill="orange", alpha=0.2) + theme_classic()
plot_grid(num1, num2, labels = c('Total', 'Units'))

as.data.frame(table(small_sales$Country))

##                                Var1 Freq
## 1                           Albania    1
## 2                            Angola    1
## 3                         Australia    3
## 4                           Austria    1
## 5                        Azerbaijan    2
## 6                        Bangladesh    1
## 7                            Belize    1
## 8                            Brunei    1
## 9                          Bulgaria    2
## 10                     Burkina Faso    1
## 11                         Cameroon    2
## 12                       Cape Verde    1
## 13                          Comoros    1
## 14                       Costa Rica    1
## 15                    Cote d'Ivoire    1
## 16 Democratic Republic of the Congo    1
## 17                         Djibouti    3
## 18                       East Timor    1
## 19   Federated States of Micronesia    1
## 20                             Fiji    1
## 21                           France    1
## 22                            Gabon    1
## 23                          Grenada    1
## 24                            Haiti    1
## 25                         Honduras    2
## 26                          Iceland    1
## 27                             Iran    1
## 28                            Kenya    1
## 29                         Kiribati    1
## 30                           Kuwait    1
## 31                       Kyrgyzstan    1
## 32                             Laos    1
## 33                          Lebanon    1
## 34                          Lesotho    1
## 35                            Libya    2
## 36                        Lithuania    1
## 37                        Macedonia    1
## 38                       Madagascar    1
## 39                         Malaysia    1
## 40                             Mali    2
## 41                       Mauritania    1
## 42                           Mexico    3
## 43                         Moldova     1
## 44                           Monaco    1
## 45                         Mongolia    1
## 46                       Mozambique    1
## 47                          Myanmar    2
## 48                      New Zealand    1
## 49                        Nicaragua    1
## 50                            Niger    1
## 51                           Norway    2
## 52                         Pakistan    1
## 53                         Portugal    1
## 54            Republic of the Congo    1
## 55                          Romania    1
## 56                           Russia    1
## 57                           Rwanda    2
## 58                           Samoa     1
## 59                       San Marino    1
## 60            Sao Tome and Principe    3
## 61                     Saudi Arabia    1
## 62                          Senegal    1
## 63                     Sierra Leone    3
## 64                         Slovakia    1
## 65                         Slovenia    1
## 66                  Solomon Islands    1
## 67                      South Sudan    1
## 68                            Spain    1
## 69                        Sri Lanka    1
## 70                      Switzerland    2
## 71                            Syria    1
## 72                       The Gambia    4
## 73                     Turkmenistan    2
## 74                           Tuvalu    1
## 75                   United Kingdom    1
## 76                           Zambia    1

region <- ggplot(small_sales, aes(x=Region)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
type <- ggplot(small_sales, aes(x=Item.Type)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
channel <- ggplot(small_sales, aes(x=Sales.Channel)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
priority <- ggplot(small_sales, aes(x=Order.Priority)) + geom_bar(color="red", fill="orange", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

plot_grid(region, type, channel, priority, labels = c( 'Region','Type', 'Channel', 'Priority'))

corrplot(cor(dplyr::select_if(small_sales, is.numeric)),
         method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5,  col=colorRampPalette(c("blue","white","red"))(200))

Large Dataset

summary(select_if(large_sales, is.numeric))

##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :   2   Min.   :  9.33   Min.   :  6.92   Min.   :     65  
##  1st Qu.:2453   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 257417  
##  Median :5123   Median :154.06   Median : 97.44   Median : 779409  
##  Mean   :5031   Mean   :265.75   Mean   :187.49   Mean   :1325738  
##  3rd Qu.:7576   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:1839975  
##  Max.   :9999   Max.   :668.27   Max.   :524.96   Max.   :6672676  
##    Total.Cost       Total.Profit      
##  Min.   :     48   Min.   :     16.9  
##  1st Qu.: 154748   1st Qu.:  85339.3  
##  Median : 468181   Median : 279095.2  
##  Mean   : 933093   Mean   : 392644.6  
##  3rd Qu.:1189578   3rd Qu.: 565106.4  
##  Max.   :5248025   Max.   :1726007.5

num1 <- ggplot(data = stack(large_sales %>% select(Total.Revenue,Total.Cost,Total.Profit)), aes(x = ind, y = values)) + geom_boxplot(color="darkgreen", fill="lightgreen", alpha=0.2) + theme_classic()
num2 <- ggplot(data = stack(large_sales %>% select(Units.Sold,Unit.Price,Unit.Cost)), aes(x = ind, y = values)) + geom_boxplot(color="darkgreen", fill="lightgreen", alpha=0.2) + theme_classic()
plot_grid(num1, num2, labels = c('Total', 'Units'))

as.data.frame(table(large_sales$Country))

##                                 Var1 Freq
## 1                        Afghanistan   19
## 2                            Albania   21
## 3                            Algeria   28
## 4                            Andorra   40
## 5                             Angola   16
## 6               Antigua and Barbuda    26
## 7                            Armenia   29
## 8                          Australia   34
## 9                            Austria   28
## 10                        Azerbaijan   27
## 11                           Bahrain   18
## 12                        Bangladesh   22
## 13                          Barbados   26
## 14                           Belarus   28
## 15                           Belgium   26
## 16                            Belize   20
## 17                             Benin   27
## 18                            Bhutan   30
## 19            Bosnia and Herzegovina   33
## 20                          Botswana   31
## 21                            Brunei   34
## 22                          Bulgaria   30
## 23                      Burkina Faso   26
## 24                           Burundi   26
## 25                          Cambodia   26
## 26                          Cameroon   31
## 27                            Canada   19
## 28                        Cape Verde   14
## 29          Central African Republic   26
## 30                              Chad   27
## 31                             China   27
## 32                           Comoros   25
## 33                        Costa Rica   26
## 34                     Cote d'Ivoire   32
## 35                           Croatia   18
## 36                              Cuba   28
## 37                            Cyprus   30
## 38                    Czech Republic   31
## 39  Democratic Republic of the Congo   23
## 40                           Denmark   26
## 41                          Djibouti   26
## 42                          Dominica   18
## 43                Dominican Republic   29
## 44                        East Timor   27
## 45                             Egypt   25
## 46                       El Salvador   24
## 47                 Equatorial Guinea   33
## 48                           Eritrea   19
## 49                           Estonia   23
## 50                          Ethiopia   26
## 51    Federated States of Micronesia   20
## 52                              Fiji   28
## 53                           Finland   23
## 54                            France   26
## 55                             Gabon   22
## 56                           Georgia   23
## 57                           Germany   26
## 58                             Ghana   38
## 59                            Greece   32
## 60                         Greenland   26
## 61                           Grenada   36
## 62                         Guatemala   28
## 63                            Guinea   33
## 64                     Guinea-Bissau   29
## 65                             Haiti   36
## 66                          Honduras   33
## 67                           Hungary   25
## 68                           Iceland   23
## 69                             India   25
## 70                         Indonesia   28
## 71                              Iran   29
## 72                              Iraq   29
## 73                           Ireland   30
## 74                            Israel   23
## 75                             Italy   30
## 76                           Jamaica   27
## 77                             Japan   21
## 78                            Jordan   31
## 79                        Kazakhstan   24
## 80                             Kenya   28
## 81                          Kiribati   25
## 82                            Kosovo   35
## 83                            Kuwait   26
## 84                        Kyrgyzstan   29
## 85                              Laos   34
## 86                            Latvia   27
## 87                           Lebanon   31
## 88                           Lesotho   23
## 89                           Liberia   27
## 90                             Libya   32
## 91                     Liechtenstein   24
## 92                         Lithuania   25
## 93                        Luxembourg   28
## 94                         Macedonia   32
## 95                        Madagascar   24
## 96                            Malawi   24
## 97                          Malaysia   27
## 98                          Maldives   20
## 99                              Mali   33
## 100                            Malta   32
## 101                 Marshall Islands   26
## 102                       Mauritania   29
## 103                       Mauritius    38
## 104                           Mexico   23
## 105                         Moldova    24
## 106                           Monaco   13
## 107                         Mongolia   15
## 108                       Montenegro   28
## 109                          Morocco   21
## 110                       Mozambique   25
## 111                          Myanmar   33
## 112                          Namibia   21
## 113                            Nauru   25
## 114                            Nepal   28
## 115                      Netherlands   24
## 116                      New Zealand   26
## 117                        Nicaragua   20
## 118                            Niger   35
## 119                          Nigeria   23
## 120                      North Korea   24
## 121                           Norway   30
## 122                             Oman   31
## 123                         Pakistan   27
## 124                            Palau   25
## 125                           Panama   32
## 126                 Papua New Guinea   20
## 127                      Philippines   24
## 128                           Poland   28
## 129                         Portugal   25
## 130                            Qatar   32
## 131            Republic of the Congo   31
## 132                          Romania   34
## 133                           Russia   32
## 134                           Rwanda   30
## 135           Saint Kitts and Nevis    32
## 136                      Saint Lucia   29
## 137 Saint Vincent and the Grenadines   22
## 138                           Samoa    35
## 139                       San Marino   40
## 140            Sao Tome and Principe   24
## 141                     Saudi Arabia   23
## 142                          Senegal   30
## 143                           Serbia   29
## 144                      Seychelles    30
## 145                     Sierra Leone   23
## 146                        Singapore   23
## 147                         Slovakia   30
## 148                         Slovenia   27
## 149                  Solomon Islands   31
## 150                          Somalia   36
## 151                     South Africa   23
## 152                      South Korea   36
## 153                      South Sudan   25
## 154                            Spain   26
## 155                        Sri Lanka   31
## 156                            Sudan   34
## 157                        Swaziland   25
## 158                           Sweden   29
## 159                      Switzerland   28
## 160                            Syria   19
## 161                           Taiwan   31
## 162                       Tajikistan   25
## 163                         Tanzania   27
## 164                         Thailand   25
## 165                      The Bahamas   23
## 166                       The Gambia   19
## 167                             Togo   30
## 168                            Tonga   37
## 169              Trinidad and Tobago   19
## 170                         Tunisia    31
## 171                           Turkey   25
## 172                     Turkmenistan   31
## 173                           Tuvalu   22
## 174                           Uganda   22
## 175                          Ukraine   33
## 176             United Arab Emirates   25
## 177                   United Kingdom   23
## 178         United States of America   38
## 179                       Uzbekistan   27
## 180                          Vanuatu   35
## 181                     Vatican City   23
## 182                          Vietnam   19
## 183                            Yemen   22
## 184                           Zambia   31
## 185                         Zimbabwe   21

region <- ggplot(large_sales, aes(x=Region)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
type <- ggplot(large_sales, aes(x=Item.Type)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
channel <- ggplot(large_sales, aes(x=Sales.Channel)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
priority <- ggplot(large_sales, aes(x=Order.Priority)) + geom_bar(color="darkgreen", fill="lightgreen", alpha=0.2)+ theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))


plot_grid(region, type, channel, priority, labels = c( 'Region','Type', 'Channel', 'Priority'))

corrplot(cor(dplyr::select_if(large_sales, is.numeric)),
         method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5, col=colorRampPalette(c("yellow","white","darkgreen"))(200))

ML Algorithms

Multiple Linear Regression

set.seed(123)
train_ind <- sample(seq_len(nrow(small_sales)), size = floor(0.85 * nrow(small_sales)))

s_train <- small_sales[train_ind, ]
s_test <- small_sales[-train_ind, ]
train_ind <- sample(seq_len(nrow(large_sales)), size = floor(0.85 * nrow(large_sales)))

l_train <- large_sales[train_ind, ]
l_test <- large_sales[-train_ind, ]

Small Data

small_lm <- lm(Total.Profit ~ Region + Item.Type + Sales.Channel + Order.Priority, data = s_train)

#summary(small_lm)

s_test$Predicted <- predict(small_lm, newdata = s_test)

# Model performance
# (a) Prediction error, RMSE
RMSE(s_test$Predicted, s_test$Total.Profit)

## [1] 307989.5

# (b) R-square
R2(s_test$Predicted, s_test$Total.Profit)

## [1] 0.1623618

Large Data

large_lm <- lm(Total.Profit ~ Region + Item.Type + Sales.Channel + Order.Priority, data = l_train)

#summary(large_lm)

l_test$Predicted <- predict(large_lm, newdata = l_test)

# Model performance
# (a) Prediction error, RMSE
RMSE(l_test$Predicted, l_test$Total.Profit)

## [1] 262668.9

# (b) R-square
R2(l_test$Predicted, l_test$Total.Profit)

## [1] 0.4640496

Multinomial Logistic Regression

set.seed(123)
train_ind <- sample(seq_len(nrow(small_sales)), size = floor(0.85 * nrow(small_sales)))

s_train <- small_sales[train_ind, ]
s_test <- small_sales[-train_ind, ]
train_ind <- sample(seq_len(nrow(large_sales)), size = floor(0.85 * nrow(large_sales)))

l_train <- large_sales[train_ind, ]
l_test <- large_sales[-train_ind, ]

Small Data

small_mlm <- multinom(Order.Priority ~ Region + Item.Type +
        Units.Sold + Unit.Price +
          Unit.Cost + Total.Cost +
          Total.Profit + Total.Revenue,data=s_train)

## # weights:  100 (72 variable)
## initial  value 117.835021 
## iter  10 value 111.239071
## iter  20 value 80.808460
## iter  30 value 75.873256
## iter  40 value 75.490182
## iter  50 value 75.465676
## iter  60 value 75.460549
## iter  70 value 75.460331
## final  value 75.460323 
## converged

#summary(small_mlm)

# Prediction
s_test$Predicted <- predict(small_mlm, newdata = s_test, "class")

tab <- table(s_test$Order.Priority, s_test$Predicted)
tab

##    
##     C H L M
##   C 0 3 0 0
##   H 2 3 1 0
##   L 0 0 3 0
##   M 0 2 1 0

# Accuracy
round((sum(diag(tab))/sum(tab))*100,2)

## [1] 40

Large Data

large_mlm <- multinom(Order.Priority ~ Region + Item.Type + 
        Units.Sold + Unit.Price +
          Unit.Cost + Total.Cost +
          Total.Profit + Total.Revenue,data=l_train)

## # weights:  100 (72 variable)
## initial  value 5891.751035 
## iter  10 value 5886.284972
## iter  20 value 5866.867730
## iter  30 value 5860.619005
## iter  40 value 5858.992679
## final  value 5858.942520 
## converged

#summary(large_mlm)

# Prediction
l_test$Predicted <- predict(large_mlm, newdata = l_test, "class")

tab2 <- table(l_test$Order.Priority, l_test$Predicted)
tab2

##    
##      C  H  L  M
##   C 28 29 37 72
##   H 30 23 57 84
##   L 24 31 53 76
##   M 32 37 45 92

# Accuracy
round((sum(diag(tab2))/sum(tab2))*100,2)

## [1] 26.13

Report

This machine learning algorithm analysis works with the Sales dataset. This dataset is randomly generated and can be found at https://excelbianalytics.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/. This analysis requires two opposing sizes of data and for that I have chosen the 100 records and 5000 records set respectively.

If we take a look at the dataset we can see that it contains 14 columns and 100 and 5000 rows respectively. The data contains information on sales transactions including associate dates, items, costs, and location. Of the fourteen columns we have two date columns, five categorical columns, six numeric columns, and one identification column.

To start off, the analysis data columns were converted to better suited data types, and checked for any null values. Each dataset did not include any missing values. In order to see the spread of the data visual and statistical summaries were performed. If we look at the box plots of the numeric columns we can see that each column is right skewed across both datasets. The categorical data is also similar across both datasets.

These datasets are very practical for machine learning analysis as they are similar to real everyday data that would require analysis. Businesses can use such data for forecasting futures sales and predicting consumer behavior. For this analysis I have decided to implement two machine learning algorithms: Multiple Linear Regression and Multinomial Logistic Regression. Each implementation has everyday uses.

Multiple Linear Regression is a regression model that estimates the relationship between a continuous dependent variable and two or more independent variables. It is a supervised machine learning algorithm and aims to find the line of best fit between the dependent and independent variables. For this implementation we are aiming to model the relationship between the Total.Profit variable and the set of independent variables: Region, Item.Type, Sales.Channel, and Order.Priority. For each dataset, the data was split into training and testing subsets based on a 85:15 ratio. The model was trained on the training data and predictions were made using the testing data. The RMSE and R-squared values were used to evaluate the model. The model for the small data set had a RMSE of 307989.5 and a R-squared of 0.1623618. The model for the large data set had a RMSE of 262668.9 and a R-squared of 0.4640496. For this algorithm the dataset with the larger records did better with a lower RMSE and R-squared. Neither of the models gave a relatively high R-squared, but seeing that the larger dataset performed better, it would be reasonable to predict that an even larger dataset would yield better results.

Multinomial Logistic Regression is a machine learning algorithm used to predict the categorical values for a dependent variable. It is a type of logistic regression where the dependent variable has more than two categories. For this implementation we are aiming to predict the Order.Priority value based on the following columns: Region, Item.Type, Units.Sold, Unit.Price, Unit.Cost, Total.Cost, Total.Profit, and Total.Revenue. The data is split into training and testing data based on a 85:15 ratio. The model is trained on the training data and predictions were made on the testing data. The model with the smaller data had a 40% accuracy in its predictions. The model with the larger data had a 26.13% accuracy in its predictions. The models seem not to work well with the larger dataset. We can test this theory in the future by using an even larger dataset.

Each algorithm has its own pros and cons. MLM seems to work better with data, while MLR works better with smaller data. Both algorithms were chosen because of the practical solutions they provide for everyday problems. While simple, these models provide a rich analysis and insight for future decisions.

Data 622 Homework 1

Krutika Patel

2023-03-04

Data Structure and Content

Glimpse

Convert Column Datatype

NA

Small Dataset

Large Dataset

ML Algorithms

Multiple Linear Regression

Small Data

Large Data

Multinomial Logistic Regression

Small Data

Large Data

Report