DATA 622 Homework 1

library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 3.6.3

library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

## Loading required package: ggplot2

Objective

Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).

https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.

Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

Data Exploration

We will first load the data, which I have saved in a local folder, and explore it.

sales_small <- readr::read_csv("10000 Sales Records.csv")

## Parsed with column specification:
## cols(
##   Region = col_character(),
##   Country = col_character(),
##   `Item Type` = col_character(),
##   `Sales Channel` = col_character(),
##   `Order Priority` = col_character(),
##   `Order Date` = col_character(),
##   `Order ID` = col_double(),
##   `Ship Date` = col_character(),
##   `Units Sold` = col_double(),
##   `Unit Price` = col_double(),
##   `Unit Cost` = col_double(),
##   `Total Revenue` = col_double(),
##   `Total Cost` = col_double(),
##   `Total Profit` = col_double()
## )

sales_large <- readr::read_csv("100000 Sales Records.csv")

## Parsed with column specification:
## cols(
##   Region = col_character(),
##   Country = col_character(),
##   `Item Type` = col_character(),
##   `Sales Channel` = col_character(),
##   `Order Priority` = col_character(),
##   `Order Date` = col_character(),
##   `Order ID` = col_double(),
##   `Ship Date` = col_character(),
##   `Units Sold` = col_double(),
##   `Unit Price` = col_double(),
##   `Unit Cost` = col_double(),
##   `Total Revenue` = col_double(),
##   `Total Cost` = col_double(),
##   `Total Profit` = col_double()
## )

The data contain the variables “Region, Country, Item Type, Sales Channel, Order Priority, Order Date, Order ID, Ship Date, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, and Total Profit”

head(sales_small)

## # A tibble: 6 x 14
##   Region    Country    `Item Type` `Sales Channel` `Order Priority` `Order Date`
##   <chr>     <chr>      <chr>       <chr>           <chr>            <chr>       
## 1 Sub-Saha~ Chad       Office Sup~ Online          L                1/27/2011   
## 2 Europe    Latvia     Beverages   Online          C                12/28/2015  
## 3 Middle E~ Pakistan   Vegetables  Offline         C                1/13/2011   
## 4 Sub-Saha~ Democrati~ Household   Online          C                9/11/2012   
## 5 Europe    Czech Rep~ Beverages   Online          C                10/27/2015  
## 6 Sub-Saha~ South Afr~ Beverages   Offline         H                7/10/2012   
## # ... with 8 more variables: Order ID <dbl>, Ship Date <chr>, Units Sold <dbl>,
## #   Unit Price <dbl>, Unit Cost <dbl>, Total Revenue <dbl>, Total Cost <dbl>,
## #   Total Profit <dbl>

From the structure of the data below, we can observe that a lot of these variables are type “Character”. We may need to transform some of the variables into factors in order to better analyze the data with R.

str(sales_small)

## spec_tbl_df [10,000 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Region        : chr [1:10000] "Sub-Saharan Africa" "Europe" "Middle East and North Africa" "Sub-Saharan Africa" ...
##  $ Country       : chr [1:10000] "Chad" "Latvia" "Pakistan" "Democratic Republic of the Congo" ...
##  $ Item Type     : chr [1:10000] "Office Supplies" "Beverages" "Vegetables" "Household" ...
##  $ Sales Channel : chr [1:10000] "Online" "Online" "Offline" "Online" ...
##  $ Order Priority: chr [1:10000] "L" "C" "C" "C" ...
##  $ Order Date    : chr [1:10000] "1/27/2011" "12/28/2015" "1/13/2011" "9/11/2012" ...
##  $ Order ID      : num [1:10000] 2.92e+08 3.62e+08 1.42e+08 5.00e+08 1.27e+08 ...
##  $ Ship Date     : chr [1:10000] "2/12/2011" "1/23/2016" "2/1/2011" "10/6/2012" ...
##  $ Units Sold    : num [1:10000] 4484 1075 6515 7683 3491 ...
##  $ Unit Price    : num [1:10000] 651.2 47.5 154.1 668.3 47.5 ...
##  $ Unit Cost     : num [1:10000] 525 31.8 90.9 502.5 31.8 ...
##  $ Total Revenue : num [1:10000] 2920026 51009 1003701 5134318 165648 ...
##  $ Total Cost    : num [1:10000] 2353921 34174 592409 3861015 110979 ...
##  $ Total Profit  : num [1:10000] 566105 16835 411292 1273304 54669 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Region = col_character(),
##   ..   Country = col_character(),
##   ..   `Item Type` = col_character(),
##   ..   `Sales Channel` = col_character(),
##   ..   `Order Priority` = col_character(),
##   ..   `Order Date` = col_character(),
##   ..   `Order ID` = col_double(),
##   ..   `Ship Date` = col_character(),
##   ..   `Units Sold` = col_double(),
##   ..   `Unit Price` = col_double(),
##   ..   `Unit Cost` = col_double(),
##   ..   `Total Revenue` = col_double(),
##   ..   `Total Cost` = col_double(),
##   ..   `Total Profit` = col_double()
##   .. )

These data contain 10000 observations and 14 variables.

dim(sales_small)

## [1] 10000    14

Below, we can also observe that we do not have any missing values on either the small or large data sets.

plot_missing(sales_small)

plot_missing(sales_large)

Data Preparation

We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform the previous variables that were of type character into factor, and the two date variables into type date.

sales_prepared <- sales_small
sales_prepared$Region <- as.factor(sales_prepared$Region)
sales_prepared$Country <- as.factor(sales_prepared$Country)
sales_prepared$`Item Type` <- as.factor(sales_prepared$`Item Type`)
sales_prepared$`Sales Channel` <- as.factor(sales_prepared$`Sales Channel`)
sales_prepared$`Order Priority` <- as.factor(sales_prepared$`Order Priority`)
sales_prepared$`Order Date` <- as.Date(sales_prepared$`Order Date`, "%m/%d/%Y")
sales_prepared$`Ship Date` <- as.Date(sales_prepared$`Ship Date`, "%m/%d/%Y")

sales_prepared2 <- sales_large
sales_prepared2$Region <- as.factor(sales_prepared2$Region)
sales_prepared2$Country <- as.factor(sales_prepared2$Country)
sales_prepared2$`Item Type` <- as.factor(sales_prepared2$`Item Type`)
sales_prepared2$`Sales Channel` <- as.factor(sales_prepared2$`Sales Channel`)
sales_prepared2$`Order Priority` <- as.factor(sales_prepared2$`Order Priority`)
sales_prepared2$`Order Date` <- as.Date(sales_prepared2$`Order Date`, "%m/%d/%Y")
sales_prepared2$`Ship Date` <- as.Date(sales_prepared2$`Ship Date`, "%m/%d/%Y")

We have changed the structure of our variables as it is shown below.

str(sales_prepared)

## spec_tbl_df [10,000 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Region        : Factor w/ 7 levels "Asia","Australia and Oceania",..: 7 4 5 7 4 7 1 1 7 3 ...
##  $ Country       : Factor w/ 185 levels "Afghanistan",..: 30 86 123 39 38 151 85 31 48 65 ...
##  $ Item Type     : Factor w/ 12 levels "Baby Food","Beverages",..: 9 2 12 7 2 2 12 1 8 9 ...
##  $ Sales Channel : Factor w/ 2 levels "Offline","Online": 2 2 1 2 2 1 2 2 2 2 ...
##  $ Order Priority: Factor w/ 4 levels "C","H","L","M": 3 1 1 1 1 2 3 1 3 1 ...
##  $ Order Date    : Date[1:10000], format: "2011-01-27" "2015-12-28" ...
##  $ Order ID      : num [1:10000] 2.92e+08 3.62e+08 1.42e+08 5.00e+08 1.27e+08 ...
##  $ Ship Date     : Date[1:10000], format: "2011-02-12" "2016-01-23" ...
##  $ Units Sold    : num [1:10000] 4484 1075 6515 7683 3491 ...
##  $ Unit Price    : num [1:10000] 651.2 47.5 154.1 668.3 47.5 ...
##  $ Unit Cost     : num [1:10000] 525 31.8 90.9 502.5 31.8 ...
##  $ Total Revenue : num [1:10000] 2920026 51009 1003701 5134318 165648 ...
##  $ Total Cost    : num [1:10000] 2353921 34174 592409 3861015 110979 ...
##  $ Total Profit  : num [1:10000] 566105 16835 411292 1273304 54669 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Region = col_character(),
##   ..   Country = col_character(),
##   ..   `Item Type` = col_character(),
##   ..   `Sales Channel` = col_character(),
##   ..   `Order Priority` = col_character(),
##   ..   `Order Date` = col_character(),
##   ..   `Order ID` = col_double(),
##   ..   `Ship Date` = col_character(),
##   ..   `Units Sold` = col_double(),
##   ..   `Unit Price` = col_double(),
##   ..   `Unit Cost` = col_double(),
##   ..   `Total Revenue` = col_double(),
##   ..   `Total Cost` = col_double(),
##   ..   `Total Profit` = col_double()
##   .. )

We can also observe the summary statistics for our transformed data.

summary(sales_prepared)

##                                Region               Country    
##  Asia                             :1469   Lithuania     :  72  
##  Australia and Oceania            : 797   United Kingdom:  72  
##  Central America and the Caribbean:1019   Moldova       :  71  
##  Europe                           :2633   Croatia       :  70  
##  Middle East and North Africa     :1264   Seychelles    :  70  
##  North America                    : 215   Botswana      :  69  
##  Sub-Saharan Africa               :2603   (Other)       :9576  
##            Item Type    Sales Channel  Order Priority   Order Date        
##  Personal Care  : 888   Offline:4939   C:2555         Min.   :2010-01-01  
##  Household      : 875   Online :5061   H:2503         1st Qu.:2011-12-08  
##  Clothes        : 872                  L:2494         Median :2013-11-02  
##  Baby Food      : 842                  M:2448         Mean   :2013-10-27  
##  Office Supplies: 837                                 3rd Qu.:2015-09-11  
##  Vegetables     : 836                                 Max.   :2017-07-28  
##  (Other)        :4850                                                     
##     Order ID           Ship Date            Units Sold      Unit Price    
##  Min.   :100089156   Min.   :2010-01-05   Min.   :    2   Min.   :  9.33  
##  1st Qu.:321806669   1st Qu.:2012-01-04   1st Qu.: 2531   1st Qu.:109.28  
##  Median :548566305   Median :2013-11-26   Median : 4962   Median :205.70  
##  Mean   :549871874   Mean   :2013-11-21   Mean   : 5003   Mean   :268.14  
##  3rd Qu.:775998104   3rd Qu.:2015-10-08   3rd Qu.: 7472   3rd Qu.:437.20  
##  Max.   :999934232   Max.   :2017-09-10   Max.   :10000   Max.   :668.27  
##                                                                           
##    Unit Cost      Total Revenue       Total Cost       Total Profit      
##  Min.   :  6.92   Min.   :    168   Min.   :    125   Min.   :     43.4  
##  1st Qu.: 56.67   1st Qu.: 288551   1st Qu.: 164786   1st Qu.:  98329.1  
##  Median :117.11   Median : 800051   Median : 481606   Median : 289099.0  
##  Mean   :188.81   Mean   :1333355   Mean   : 938266   Mean   : 395089.3  
##  3rd Qu.:364.69   3rd Qu.:1819143   3rd Qu.:1183822   3rd Qu.: 566422.7  
##  Max.   :524.96   Max.   :6680027   Max.   :5241726   Max.   :1738178.4  
##

The plots below show that most of the sales take place or come from the regions of Europe and Sub-Saharan Africa. Additionally, Offline and Online sales look evenly distributed among the data.

plot(sales_prepared$`Sales Channel`)

plot(sales_prepared$Region)

Build Models

I would like to predict the “Sales Channel” variable and find out whether the sale was done “Online” or “Offline” based on the “Country” variable that the sale came from or was conducted. I will also include the “Item Type” as a predictor variable to help us in the model.

I have decided to us a Logistic Regression model as the dependent variable in this case is categorical and binary with two possible outcomes, “Offline” and “Online”. I will also compare the LR model with a KNN model as this algorithm is simple and easy-to-implement. Additionally, the KNN model can be used for both classification and regression problems, and given we have data that will allow the algorithm to execute a supervised machine learning model, this could produce very interesting results and predictions.

The code below has been borrowed from the class examples in order to run a Logistic Regression and KNN model on our data.

We first take a look at the “Offline” and “Online” numbers by region to detect any relationships. At a first glance, we can observe that the number of “Sales Channel” seem to be evenly distributed region by region.

# Two-way table of factor variables
xtabs(~`Sales Channel` + Region, data = sales_prepared)

##              Region
## Sales Channel Asia Australia and Oceania Central America and the Caribbean
##       Offline  708                   373                               517
##       Online   761                   424                               502
##              Region
## Sales Channel Europe Middle East and North Africa North America
##       Offline   1318                          625           109
##       Online    1315                          639           106
##              Region
## Sales Channel Sub-Saharan Africa
##       Offline               1289
##       Online                1314

Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.

# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(sales_prepared), replace = T, prob = c(0.8, 0.2))
train <- sales_prepared[ind==1,]
test <- sales_prepared[ind==2,]

Logistic Regression

We run the logistic regression model below:

# Logistic regression model
mymodel <- glm(`Sales Channel` ~ Country + `Item Type`, data = train, family = 'binomial')
summary(mymodel)

## 
## Call:
## glm(formula = `Sales Channel` ~ Country + `Item Type`, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6915  -1.1602   0.7409   1.1559   1.5972  
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                             -0.084246   0.304140  -0.277   0.7818  
## CountryAlbania                           0.080129   0.434282   0.185   0.8536  
## CountryAlgeria                           0.129147   0.420600   0.307   0.7588  
## CountryAndorra                           0.174690   0.455553   0.383   0.7014  
## CountryAngola                           -0.384319   0.438937  -0.876   0.3813  
## CountryAntigua and Barbuda              -0.220430   0.438400  -0.503   0.6151  
## CountryArmenia                           0.179499   0.434502   0.413   0.6795  
## CountryAustralia                         0.572109   0.434332   1.317   0.1878  
## CountryAustria                           0.050946   0.420683   0.121   0.9036  
## CountryAzerbaijan                        0.584544   0.420818   1.389   0.1648  
## CountryBahrain                          -0.454549   0.421998  -1.077   0.2814  
## CountryBangladesh                        0.595475   0.414599   1.436   0.1509  
## CountryBarbados                          0.242312   0.431636   0.561   0.5745  
## CountryBelarus                           0.221259   0.431600   0.513   0.6082  
## CountryBelgium                          -0.106862   0.445472  -0.240   0.8104  
## CountryBelize                            0.293237   0.448407   0.654   0.5131  
## CountryBenin                            -0.206775   0.414364  -0.499   0.6178  
## CountryBhutan                            0.327517   0.426716   0.768   0.4428  
## CountryBosnia and Herzegovina           -0.485337   0.426441  -1.138   0.2551  
## CountryBotswana                          0.430343   0.402105   1.070   0.2845  
## CountryBrunei                            0.404662   0.422584   0.958   0.3383  
## CountryBulgaria                          0.089878   0.447623   0.201   0.8409  
## CountryBurkina Faso                      0.506404   0.429898   1.178   0.2388  
## CountryBurundi                          -0.366183   0.407710  -0.898   0.3691  
## CountryCambodia                          0.718402   0.415766   1.728   0.0840 .
## CountryCameroon                         -0.333279   0.450438  -0.740   0.4594  
## CountryCanada                           -0.207151   0.389917  -0.531   0.5952  
## CountryCape Verde                        0.324849   0.448728   0.724   0.4691  
## CountryCentral African Republic         -0.032155   0.441636  -0.073   0.9420  
## CountryChad                              0.461329   0.447495   1.031   0.3026  
## CountryChina                             0.122602   0.420779   0.291   0.7708  
## CountryComoros                           0.187092   0.440578   0.425   0.6711  
## CountryCosta Rica                        0.347829   0.438830   0.793   0.4280  
## CountryCote d'Ivoire                     0.369700   0.438622   0.843   0.3993  
## CountryCroatia                           0.059983   0.396882   0.151   0.8799  
## CountryCuba                              0.020201   0.425920   0.047   0.9622  
## CountryCyprus                            0.223972   0.437554   0.512   0.6087  
## CountryCzech Republic                   -0.258961   0.430015  -0.602   0.5470  
## CountryDemocratic Republic of the Congo  0.026567   0.402122   0.066   0.9473  
## CountryDenmark                           0.284055   0.404259   0.703   0.4823  
## CountryDjibouti                          0.379381   0.417740   0.908   0.3638  
## CountryDominica                         -0.054182   0.416564  -0.130   0.8965  
## CountryDominican Republic               -0.025499   0.429149  -0.059   0.9526  
## CountryEast Timor                        1.059667   0.475582   2.228   0.0259 *
## CountryEgypt                             0.029999   0.416076   0.072   0.9425  
## CountryEl Salvador                       0.167878   0.401753   0.418   0.6760  
## CountryEquatorial Guinea                 0.140472   0.437280   0.321   0.7480  
## CountryEritrea                           0.732428   0.463487   1.580   0.1140  
## CountryEstonia                          -0.242958   0.429932  -0.565   0.5720  
## CountryEthiopia                          0.124850   0.415687   0.300   0.7639  
## CountryFederated States of Micronesia    0.045999   0.425957   0.108   0.9140  
## CountryFiji                              0.751068   0.426241   1.762   0.0781 .
## CountryFinland                           0.323784   0.421532   0.768   0.4424  
## CountryFrance                            0.035080   0.420941   0.083   0.9336  
## CountryGabon                            -0.058534   0.432771  -0.135   0.8924  
## CountryGeorgia                           0.130716   0.420496   0.311   0.7559  
## CountryGermany                           0.316967   0.421435   0.752   0.4520  
## CountryGhana                             0.446371   0.426339   1.047   0.2951  
## CountryGreece                            0.248792   0.431598   0.576   0.5643  
## CountryGreenland                        -0.168480   0.458253  -0.368   0.7131  
## CountryGrenada                          -0.218457   0.408539  -0.535   0.5928  
## CountryGuatemala                         0.258926   0.409673   0.632   0.5274  
## CountryGuinea                            0.471191   0.431778   1.091   0.2751  
## CountryGuinea-Bissau                    -0.082656   0.414910  -0.199   0.8421  
## CountryHaiti                            -0.314074   0.433932  -0.724   0.4692  
## CountryHonduras                          0.057726   0.407433   0.142   0.8873  
## CountryHungary                          -0.059738   0.421389  -0.142   0.8873  
## CountryIceland                           0.096666   0.413702   0.234   0.8152  
## CountryIndia                             0.728228   0.414469   1.757   0.0789 .
## CountryIndonesia                        -0.277066   0.429874  -0.645   0.5192  
## CountryIran                              0.384491   0.399658   0.962   0.3360  
## CountryIraq                              0.455195   0.426127   1.068   0.2854  
## CountryIreland                           0.544319   0.436287   1.248   0.2122  
## CountryIsrael                            0.037578   0.411631   0.091   0.9273  
## CountryItaly                             0.063933   0.447414   0.143   0.8864  
## CountryJamaica                          -0.148747   0.442908  -0.336   0.7370  
## CountryJapan                             0.173508   0.418183   0.415   0.6782  
## CountryJordan                            0.683843   0.429160   1.593   0.1111  
## CountryKazakhstan                        0.820658   0.459458   1.786   0.0741 .
## CountryKenya                             0.006786   0.409795   0.017   0.9868  
## CountryKiribati                         -0.155606   0.398540  -0.390   0.6962  
## CountryKosovo                            0.069345   0.405588   0.171   0.8642  
## CountryKuwait                            0.001477   0.413771   0.004   0.9972  
## CountryKyrgyzstan                       -0.173719   0.415843  -0.418   0.6761  
## CountryLaos                             -0.023447   0.434714  -0.054   0.9570  
## CountryLatvia                            0.365470   0.452983   0.807   0.4198  
## CountryLebanon                           0.381957   0.430069   0.888   0.3745  
## CountryLesotho                           0.117259   0.411435   0.285   0.7756  
## CountryLiberia                          -0.574866   0.448872  -1.281   0.2003  
## CountryLibya                             0.033509   0.426264   0.079   0.9373  
## CountryLiechtenstein                     0.047131   0.407579   0.116   0.9079  
## CountryLithuania                         0.230487   0.392228   0.588   0.5568  
## CountryLuxembourg                        0.263031   0.429109   0.613   0.5399  
## CountryMacedonia                        -0.414528   0.456429  -0.908   0.3638  
## CountryMadagascar                       -0.284271   0.422413  -0.673   0.5010  
## CountryMalawi                            0.230024   0.392408   0.586   0.5577  
## CountryMalaysia                         -0.205752   0.432128  -0.476   0.6340  
## CountryMaldives                          0.240720   0.426174   0.565   0.5722  
## CountryMali                              0.390099   0.471056   0.828   0.4076  
## CountryMalta                             0.575517   0.450398   1.278   0.2013  
## CountryMarshall Islands                 -0.141202   0.450143  -0.314   0.7538  
## CountryMauritania                        0.323197   0.426782   0.757   0.4489  
## CountryMauritius                         0.152995   0.413805   0.370   0.7116  
## CountryMexico                           -0.052045   0.441578  -0.118   0.9062  
## CountryMoldova                          -0.226280   0.396340  -0.571   0.5681  
## CountryMonaco                            0.446451   0.493736   0.904   0.3659  
## CountryMongolia                         -0.010527   0.423529  -0.025   0.9802  
## CountryMontenegro                        0.196883   0.393598   0.500   0.6169  
## CountryMorocco                           0.317286   0.395735   0.802   0.4227  
## CountryMozambique                       -0.233389   0.418945  -0.557   0.5775  
## CountryMyanmar                           0.019857   0.444523   0.045   0.9644  
## CountryNamibia                           0.269894   0.423680   0.637   0.5241  
## CountryNauru                             0.133485   0.431119   0.310   0.7568  
## CountryNepal                             0.392972   0.443393   0.886   0.3755  
## CountryNetherlands                       0.040905   0.437596   0.093   0.9255  
## CountryNew Zealand                       0.331040   0.410398   0.807   0.4199  
## CountryNicaragua                         0.437556   0.467507   0.936   0.3493  
## CountryNiger                            -0.129800   0.417712  -0.311   0.7560  
## CountryNigeria                           0.018611   0.444288   0.042   0.9666  
## CountryNorth Korea                      -0.123063   0.417743  -0.295   0.7683  
## CountryNorway                           -0.108176   0.445508  -0.243   0.8081  
## CountryOman                              0.428251   0.422630   1.013   0.3109  
## CountryPakistan                          0.012131   0.437567   0.028   0.9779  
## CountryPalau                             0.423564   0.422609   1.002   0.3162  
## CountryPanama                            0.087359   0.428629   0.204   0.8385  
## CountryPapua New Guinea                  0.402853   0.450192   0.895   0.3709  
## CountryPhilippines                      -0.147036   0.422679  -0.348   0.7279  
## CountryPoland                           -0.227323   0.418935  -0.543   0.5874  
## CountryPortugal                          0.048500   0.411753   0.118   0.9062  
## CountryQatar                            -0.114908   0.403596  -0.285   0.7759  
## CountryRepublic of the Congo             0.259003   0.414234   0.625   0.5318  
## CountryRomania                           0.384472   0.430040   0.894   0.3713  
## CountryRussia                           -0.095968   0.419570  -0.229   0.8191  
## CountryRwanda                            0.209427   0.407586   0.514   0.6074  
## CountrySaint Kitts and Nevis            -0.252535   0.407165  -0.620   0.5351  
## CountrySaint Lucia                       0.513987   0.464357   1.107   0.2683  
## CountrySaint Vincent and the Grenadines  0.523002   0.445031   1.175   0.2399  
## CountrySamoa                             0.451333   0.410041   1.101   0.2710  
## CountrySan Marino                       -0.642178   0.455789  -1.409   0.1589  
## CountrySao Tome and Principe             0.441181   0.420887   1.048   0.2945  
## CountrySaudi Arabia                      0.403109   0.428044   0.942   0.3463  
## CountrySenegal                          -0.119740   0.413182  -0.290   0.7720  
## CountrySerbia                           -0.241054   0.424208  -0.568   0.5699  
## CountrySeychelles                        0.144282   0.395196   0.365   0.7150  
## CountrySierra Leone                      0.275383   0.435014   0.633   0.5267  
## CountrySingapore                         0.076476   0.440774   0.174   0.8623  
## CountrySlovakia                         -0.254316   0.445008  -0.571   0.5677  
## CountrySlovenia                          0.101756   0.405870   0.251   0.8020  
## CountrySolomon Islands                   0.124176   0.437371   0.284   0.7765  
## CountrySomalia                           0.260288   0.424030   0.614   0.5393  
## CountrySouth Africa                      0.174715   0.428429   0.408   0.6834  
## CountrySouth Korea                       0.440281   0.415871   1.059   0.2897  
## CountrySouth Sudan                       0.234051   0.431689   0.542   0.5877  
## CountrySpain                             0.665990   0.430795   1.546   0.1221  
## CountrySri Lanka                         0.084400   0.418468   0.202   0.8402  
## CountrySudan                            -0.105248   0.419239  -0.251   0.8018  
## CountrySwaziland                         0.159101   0.405604   0.392   0.6949  
## CountrySweden                           -0.091905   0.453094  -0.203   0.8393  
## CountrySwitzerland                       0.574127   0.421118   1.363   0.1728  
## CountrySyria                             0.186420   0.455469   0.409   0.6823  
## CountryTaiwan                            0.011200   0.406048   0.028   0.9780  
## CountryTajikistan                        1.231747   0.525996   2.342   0.0192 *
## CountryTanzania                          0.175299   0.423204   0.414   0.6787  
## CountryThailand                          0.407242   0.427935   0.952   0.3413  
## CountryThe Bahamas                      -0.326681   0.434527  -0.752   0.4522  
## CountryThe Gambia                        0.171782   0.455625   0.377   0.7062  
## CountryTogo                             -0.088768   0.414679  -0.214   0.8305  
## CountryTonga                             0.133220   0.451647   0.295   0.7680  
## CountryTrinidad and Tobago              -0.100891   0.424583  -0.238   0.8122  
## CountryTunisia                          -0.287003   0.428000  -0.671   0.5025  
## CountryTurkey                           -0.722408   0.464132  -1.556   0.1196  
## CountryTurkmenistan                     -0.317598   0.421239  -0.754   0.4509  
## CountryTuvalu                            0.099008   0.405541   0.244   0.8071  
## CountryUganda                            0.217221   0.426129   0.510   0.6102  
## CountryUkraine                           0.074004   0.423130   0.175   0.8612  
## CountryUnited Arab Emirates             -0.509348   0.426431  -1.194   0.2323  
## CountryUnited Kingdom                    0.328439   0.388523   0.845   0.3979  
## CountryUnited States of America          0.347281   0.410467   0.846   0.3975  
## CountryUzbekistan                        0.367002   0.430103   0.853   0.3935  
## CountryVanuatu                           0.227412   0.437481   0.520   0.6032  
## CountryVatican City                      0.250156   0.418610   0.598   0.5501  
## CountryVietnam                           0.161808   0.401676   0.403   0.6871  
## CountryYemen                             0.243398   0.451606   0.539   0.5899  
## CountryZambia                            0.523077   0.411595   1.271   0.2038  
## CountryZimbabwe                          0.728363   0.414733   1.756   0.0791 .
## `Item Type`Beverages                     0.009793   0.113664   0.086   0.9313  
## `Item Type`Cereal                        0.071865   0.111969   0.642   0.5210  
## `Item Type`Clothes                      -0.132251   0.111096  -1.190   0.2339  
## `Item Type`Cosmetics                    -0.085303   0.111767  -0.763   0.4453  
## `Item Type`Fruits                        0.049436   0.113402   0.436   0.6629  
## `Item Type`Household                    -0.074539   0.110025  -0.677   0.4981  
## `Item Type`Meat                         -0.104381   0.112417  -0.929   0.3531  
## `Item Type`Office Supplies              -0.030833   0.110871  -0.278   0.7809  
## `Item Type`Personal Care                -0.011146   0.109706  -0.102   0.9191  
## `Item Type`Snacks                       -0.081662   0.112331  -0.727   0.4672  
## `Item Type`Vegetables                   -0.141253   0.111490  -1.267   0.2052  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11110  on 8013  degrees of freedom
## Residual deviance: 10923  on 7818  degrees of freedom
## AIC: 11315
## 
## Number of Fisher Scoring iterations: 4

# Prediction
p1 <- predict(mymodel, train, type = 'response')
head(p1)

##         1         2         3         4         5         6 
## 0.5857078 0.5722450 0.4468595 0.4669937 0.5250446 0.4380831

It seems that our predictions from the training set are only about 44% accurate, which is not great.

# Misclassification error - train data
pred1 <- ifelse(p1>0.5, 1, 0)
tab1 <- table(Predicted = pred1, Actual = train$`Sales Channel`)
tab1

##          Actual
## Predicted Offline Online
##         0    2217   1728
##         1    1784   2285

1 - sum(diag(tab1))/sum(tab1)

## [1] 0.4382331

The predictions from the test set have improved a bit, increasing to 51%

# Misclassification error - test data
p2 <- predict(mymodel, test, type = 'response')
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Predicted = pred2, Actual = test$`Sales Channel`)
tab2

##          Actual
## Predicted Offline Online
##         0     458    533
##         1     480    515

1 - sum(diag(tab2))/sum(tab2)

## [1] 0.5100705

Additionally, we have also gotten a value of 0.66 for our goodness of fit test, suggesting that the results are not significant and the variable “Sales Channel” is independent from, or has no relationship with “Country” and “Item Type”.

# Goodness-of-fit test
with(mymodel, pchisq(null.deviance - deviance, df.null-df.residual, lower.tail = F))

## [1] 0.6597994

KNN

Now we will use the KNN model and see if our predictions can be improved.

# K-NN model
trControl <- trainControl(method = "repeatedcv", #repeated cross-validation
                          number = 3,  # number of resampling iterations
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary)  # classProbs needed for ROC
set.seed(1234)
mymodel_knn <- train(`Sales Channel` ~ Country + `Item Type`, 
             data = train,
             method = "knn",
             metric = "ROC",
             trControl = trControl,
                    )

As we can see below, the ROC is close to 50% with k=9. This suggests that this model has not improved our predictions much more than the logistic regression model.

# Model performance
mymodel_knn

## k-Nearest Neighbors 
## 
## 8014 samples
##    2 predictor
##    2 classes: 'Offline', 'Online' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 1 times) 
## Summary of sample sizes: 5342, 5343, 5343 
## Resampling results across tuning parameters:
## 
##   k  ROC        Sens       Spec     
##   5  0.4949696  0.5003738  0.4829309
##   7  0.4952805  0.4826272  0.5028699
##   9  0.4976715  0.4718764  0.5242975
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

In the following graph we can observe how the ROC improves from 7 to 9 neighbors.

plot(mymodel_knn)

p3 <- predict(mymodel_knn, newdata = test )

The accuracy has also not improved compared to the logistic regression model on our test set:

confusionMatrix(p3, test$`Sales Channel`, positive = 'Offline' )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Offline Online
##    Offline     449    514
##    Online      489    534
##                                           
##                Accuracy : 0.495           
##                  95% CI : (0.4727, 0.5172)
##     No Information Rate : 0.5277          
##     P-Value [Acc > NIR] : 0.9984          
##                                           
##                   Kappa : -0.0118         
##                                           
##  Mcnemar's Test P-Value : 0.4486          
##                                           
##             Sensitivity : 0.4787          
##             Specificity : 0.5095          
##          Pos Pred Value : 0.4663          
##          Neg Pred Value : 0.5220          
##              Prevalence : 0.4723          
##          Detection Rate : 0.2261          
##    Detection Prevalence : 0.4849          
##       Balanced Accuracy : 0.4941          
##                                           
##        'Positive' Class : Offline         
##

More Data

Now we will use the larger data set to find out if adding data to our model will improve its predictive accuracy. I have decided to keep using the logistic regression model as it gave us slightly better results than the KNN model when using the test set.

Our larger data set contains 100,000 observations and the same 14 variables as our previous smaller data set.

dim(sales_large)

## [1] 100000     14

We partition the data as previously done using the “sales_prepared2” data set that we transformed earlier.

# Partition data - train (80%) & test (20%)
set.seed(12345)
ind <- sample(2, nrow(sales_prepared2), replace = T, prob = c(0.8, 0.2))
train2 <- sales_prepared2[ind==1,]
test2 <- sales_prepared2[ind==2,]

We run the logistic regression model below:

# Logistic regression model
mymodel2 <- glm(`Sales Channel` ~ Country + `Item Type`, data = train2, family = 'binomial')

# Prediction
p4 <- predict(mymodel2, train2, type = 'response')
head(p4)

##         1         2         3         4         5         6 
## 0.5041427 0.4736320 0.5033300 0.4812581 0.5111808 0.4865881

The accuracy of our model does not seem to have improved with more data.

# Misclassification error - train data
pred4 <- ifelse(p4>0.5, 1, 0)
tab4 <- table(Predicted = pred4, Actual = train2$`Sales Channel`)
tab4

##          Actual
## Predicted Offline Online
##         0   19549  18076
##         1   20415  21958

1 - sum(diag(tab4))/sum(tab4)

## [1] 0.4811495

The predictions from the test set have improved a bit, increasing to almost 50%. However, we did not see a big chance in accuracy when adding more data to our model.

# Misclassification error - test data
p5 <- predict(mymodel2, test2, type = 'response')
pred5 <- ifelse(p5>0.5, 1, 0)
tab5 <- table(Predicted = pred5, Actual = test2$`Sales Channel`)
tab5

##          Actual
## Predicted Offline Online
##         0    4761   4735
##         1    5221   5285

1 - sum(diag(tab5))/sum(tab5)

## [1] 0.4977502

Conclusion

After executing different models on different sizes of data sets, I have not been able to determine that adding or having a bigger data set improves the efficacy of a model. Perhaps the data I started with was already large enough and it did not make a difference how much more data was included in the model. Additionally, the models have room for improvement if we consider some of the other predictor variables in the data. We could potentially also predict “Units Sold” based on our other variables and could be an interesting problem to solve.