library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.6.3
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## Loading required package: ggplot2
Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).
https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.
Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
We will first load the data, which I have saved in a local folder, and explore it.
sales_small <- readr::read_csv("10000 Sales Records.csv")
## Parsed with column specification:
## cols(
## Region = col_character(),
## Country = col_character(),
## `Item Type` = col_character(),
## `Sales Channel` = col_character(),
## `Order Priority` = col_character(),
## `Order Date` = col_character(),
## `Order ID` = col_double(),
## `Ship Date` = col_character(),
## `Units Sold` = col_double(),
## `Unit Price` = col_double(),
## `Unit Cost` = col_double(),
## `Total Revenue` = col_double(),
## `Total Cost` = col_double(),
## `Total Profit` = col_double()
## )
sales_large <- readr::read_csv("100000 Sales Records.csv")
## Parsed with column specification:
## cols(
## Region = col_character(),
## Country = col_character(),
## `Item Type` = col_character(),
## `Sales Channel` = col_character(),
## `Order Priority` = col_character(),
## `Order Date` = col_character(),
## `Order ID` = col_double(),
## `Ship Date` = col_character(),
## `Units Sold` = col_double(),
## `Unit Price` = col_double(),
## `Unit Cost` = col_double(),
## `Total Revenue` = col_double(),
## `Total Cost` = col_double(),
## `Total Profit` = col_double()
## )
The data contain the variables “Region, Country, Item Type, Sales Channel, Order Priority, Order Date, Order ID, Ship Date, Units Sold, Unit Price, Unit Cost, Total Revenue, Total Cost, and Total Profit”
head(sales_small)
## # A tibble: 6 x 14
## Region Country `Item Type` `Sales Channel` `Order Priority` `Order Date`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Sub-Saha~ Chad Office Sup~ Online L 1/27/2011
## 2 Europe Latvia Beverages Online C 12/28/2015
## 3 Middle E~ Pakistan Vegetables Offline C 1/13/2011
## 4 Sub-Saha~ Democrati~ Household Online C 9/11/2012
## 5 Europe Czech Rep~ Beverages Online C 10/27/2015
## 6 Sub-Saha~ South Afr~ Beverages Offline H 7/10/2012
## # ... with 8 more variables: Order ID <dbl>, Ship Date <chr>, Units Sold <dbl>,
## # Unit Price <dbl>, Unit Cost <dbl>, Total Revenue <dbl>, Total Cost <dbl>,
## # Total Profit <dbl>
From the structure of the data below, we can observe that a lot of these variables are type “Character”. We may need to transform some of the variables into factors in order to better analyze the data with R.
str(sales_small)
## spec_tbl_df [10,000 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Region : chr [1:10000] "Sub-Saharan Africa" "Europe" "Middle East and North Africa" "Sub-Saharan Africa" ...
## $ Country : chr [1:10000] "Chad" "Latvia" "Pakistan" "Democratic Republic of the Congo" ...
## $ Item Type : chr [1:10000] "Office Supplies" "Beverages" "Vegetables" "Household" ...
## $ Sales Channel : chr [1:10000] "Online" "Online" "Offline" "Online" ...
## $ Order Priority: chr [1:10000] "L" "C" "C" "C" ...
## $ Order Date : chr [1:10000] "1/27/2011" "12/28/2015" "1/13/2011" "9/11/2012" ...
## $ Order ID : num [1:10000] 2.92e+08 3.62e+08 1.42e+08 5.00e+08 1.27e+08 ...
## $ Ship Date : chr [1:10000] "2/12/2011" "1/23/2016" "2/1/2011" "10/6/2012" ...
## $ Units Sold : num [1:10000] 4484 1075 6515 7683 3491 ...
## $ Unit Price : num [1:10000] 651.2 47.5 154.1 668.3 47.5 ...
## $ Unit Cost : num [1:10000] 525 31.8 90.9 502.5 31.8 ...
## $ Total Revenue : num [1:10000] 2920026 51009 1003701 5134318 165648 ...
## $ Total Cost : num [1:10000] 2353921 34174 592409 3861015 110979 ...
## $ Total Profit : num [1:10000] 566105 16835 411292 1273304 54669 ...
## - attr(*, "spec")=
## .. cols(
## .. Region = col_character(),
## .. Country = col_character(),
## .. `Item Type` = col_character(),
## .. `Sales Channel` = col_character(),
## .. `Order Priority` = col_character(),
## .. `Order Date` = col_character(),
## .. `Order ID` = col_double(),
## .. `Ship Date` = col_character(),
## .. `Units Sold` = col_double(),
## .. `Unit Price` = col_double(),
## .. `Unit Cost` = col_double(),
## .. `Total Revenue` = col_double(),
## .. `Total Cost` = col_double(),
## .. `Total Profit` = col_double()
## .. )
These data contain 10000 observations and 14 variables.
dim(sales_small)
## [1] 10000 14
Below, we can also observe that we do not have any missing values on either the small or large data sets.
plot_missing(sales_small)
plot_missing(sales_large)
We will create a separate data set in order to maintain the original data and make all the necessary transformations there. First we’ll transform the previous variables that were of type character into factor, and the two date variables into type date.
sales_prepared <- sales_small
sales_prepared$Region <- as.factor(sales_prepared$Region)
sales_prepared$Country <- as.factor(sales_prepared$Country)
sales_prepared$`Item Type` <- as.factor(sales_prepared$`Item Type`)
sales_prepared$`Sales Channel` <- as.factor(sales_prepared$`Sales Channel`)
sales_prepared$`Order Priority` <- as.factor(sales_prepared$`Order Priority`)
sales_prepared$`Order Date` <- as.Date(sales_prepared$`Order Date`, "%m/%d/%Y")
sales_prepared$`Ship Date` <- as.Date(sales_prepared$`Ship Date`, "%m/%d/%Y")
sales_prepared2 <- sales_large
sales_prepared2$Region <- as.factor(sales_prepared2$Region)
sales_prepared2$Country <- as.factor(sales_prepared2$Country)
sales_prepared2$`Item Type` <- as.factor(sales_prepared2$`Item Type`)
sales_prepared2$`Sales Channel` <- as.factor(sales_prepared2$`Sales Channel`)
sales_prepared2$`Order Priority` <- as.factor(sales_prepared2$`Order Priority`)
sales_prepared2$`Order Date` <- as.Date(sales_prepared2$`Order Date`, "%m/%d/%Y")
sales_prepared2$`Ship Date` <- as.Date(sales_prepared2$`Ship Date`, "%m/%d/%Y")
We have changed the structure of our variables as it is shown below.
str(sales_prepared)
## spec_tbl_df [10,000 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Region : Factor w/ 7 levels "Asia","Australia and Oceania",..: 7 4 5 7 4 7 1 1 7 3 ...
## $ Country : Factor w/ 185 levels "Afghanistan",..: 30 86 123 39 38 151 85 31 48 65 ...
## $ Item Type : Factor w/ 12 levels "Baby Food","Beverages",..: 9 2 12 7 2 2 12 1 8 9 ...
## $ Sales Channel : Factor w/ 2 levels "Offline","Online": 2 2 1 2 2 1 2 2 2 2 ...
## $ Order Priority: Factor w/ 4 levels "C","H","L","M": 3 1 1 1 1 2 3 1 3 1 ...
## $ Order Date : Date[1:10000], format: "2011-01-27" "2015-12-28" ...
## $ Order ID : num [1:10000] 2.92e+08 3.62e+08 1.42e+08 5.00e+08 1.27e+08 ...
## $ Ship Date : Date[1:10000], format: "2011-02-12" "2016-01-23" ...
## $ Units Sold : num [1:10000] 4484 1075 6515 7683 3491 ...
## $ Unit Price : num [1:10000] 651.2 47.5 154.1 668.3 47.5 ...
## $ Unit Cost : num [1:10000] 525 31.8 90.9 502.5 31.8 ...
## $ Total Revenue : num [1:10000] 2920026 51009 1003701 5134318 165648 ...
## $ Total Cost : num [1:10000] 2353921 34174 592409 3861015 110979 ...
## $ Total Profit : num [1:10000] 566105 16835 411292 1273304 54669 ...
## - attr(*, "spec")=
## .. cols(
## .. Region = col_character(),
## .. Country = col_character(),
## .. `Item Type` = col_character(),
## .. `Sales Channel` = col_character(),
## .. `Order Priority` = col_character(),
## .. `Order Date` = col_character(),
## .. `Order ID` = col_double(),
## .. `Ship Date` = col_character(),
## .. `Units Sold` = col_double(),
## .. `Unit Price` = col_double(),
## .. `Unit Cost` = col_double(),
## .. `Total Revenue` = col_double(),
## .. `Total Cost` = col_double(),
## .. `Total Profit` = col_double()
## .. )
We can also observe the summary statistics for our transformed data.
summary(sales_prepared)
## Region Country
## Asia :1469 Lithuania : 72
## Australia and Oceania : 797 United Kingdom: 72
## Central America and the Caribbean:1019 Moldova : 71
## Europe :2633 Croatia : 70
## Middle East and North Africa :1264 Seychelles : 70
## North America : 215 Botswana : 69
## Sub-Saharan Africa :2603 (Other) :9576
## Item Type Sales Channel Order Priority Order Date
## Personal Care : 888 Offline:4939 C:2555 Min. :2010-01-01
## Household : 875 Online :5061 H:2503 1st Qu.:2011-12-08
## Clothes : 872 L:2494 Median :2013-11-02
## Baby Food : 842 M:2448 Mean :2013-10-27
## Office Supplies: 837 3rd Qu.:2015-09-11
## Vegetables : 836 Max. :2017-07-28
## (Other) :4850
## Order ID Ship Date Units Sold Unit Price
## Min. :100089156 Min. :2010-01-05 Min. : 2 Min. : 9.33
## 1st Qu.:321806669 1st Qu.:2012-01-04 1st Qu.: 2531 1st Qu.:109.28
## Median :548566305 Median :2013-11-26 Median : 4962 Median :205.70
## Mean :549871874 Mean :2013-11-21 Mean : 5003 Mean :268.14
## 3rd Qu.:775998104 3rd Qu.:2015-10-08 3rd Qu.: 7472 3rd Qu.:437.20
## Max. :999934232 Max. :2017-09-10 Max. :10000 Max. :668.27
##
## Unit Cost Total Revenue Total Cost Total Profit
## Min. : 6.92 Min. : 168 Min. : 125 Min. : 43.4
## 1st Qu.: 56.67 1st Qu.: 288551 1st Qu.: 164786 1st Qu.: 98329.1
## Median :117.11 Median : 800051 Median : 481606 Median : 289099.0
## Mean :188.81 Mean :1333355 Mean : 938266 Mean : 395089.3
## 3rd Qu.:364.69 3rd Qu.:1819143 3rd Qu.:1183822 3rd Qu.: 566422.7
## Max. :524.96 Max. :6680027 Max. :5241726 Max. :1738178.4
##
The plots below show that most of the sales take place or come from the regions of Europe and Sub-Saharan Africa. Additionally, Offline and Online sales look evenly distributed among the data.
plot(sales_prepared$`Sales Channel`)
plot(sales_prepared$Region)
I would like to predict the “Sales Channel” variable and find out whether the sale was done “Online” or “Offline” based on the “Country” variable that the sale came from or was conducted. I will also include the “Item Type” as a predictor variable to help us in the model.
I have decided to us a Logistic Regression model as the dependent variable in this case is categorical and binary with two possible outcomes, “Offline” and “Online”. I will also compare the LR model with a KNN model as this algorithm is simple and easy-to-implement. Additionally, the KNN model can be used for both classification and regression problems, and given we have data that will allow the algorithm to execute a supervised machine learning model, this could produce very interesting results and predictions.
The code below has been borrowed from the class examples in order to run a Logistic Regression and KNN model on our data.
We first take a look at the “Offline” and “Online” numbers by region to detect any relationships. At a first glance, we can observe that the number of “Sales Channel” seem to be evenly distributed region by region.
# Two-way table of factor variables
xtabs(~`Sales Channel` + Region, data = sales_prepared)
## Region
## Sales Channel Asia Australia and Oceania Central America and the Caribbean
## Offline 708 373 517
## Online 761 424 502
## Region
## Sales Channel Europe Middle East and North Africa North America
## Offline 1318 625 109
## Online 1315 639 106
## Region
## Sales Channel Sub-Saharan Africa
## Offline 1289
## Online 1314
Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.
# Partition data - train (80%) & test (20%)
set.seed(1234)
ind <- sample(2, nrow(sales_prepared), replace = T, prob = c(0.8, 0.2))
train <- sales_prepared[ind==1,]
test <- sales_prepared[ind==2,]
We run the logistic regression model below:
# Logistic regression model
mymodel <- glm(`Sales Channel` ~ Country + `Item Type`, data = train, family = 'binomial')
summary(mymodel)
##
## Call:
## glm(formula = `Sales Channel` ~ Country + `Item Type`, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6915 -1.1602 0.7409 1.1559 1.5972
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.084246 0.304140 -0.277 0.7818
## CountryAlbania 0.080129 0.434282 0.185 0.8536
## CountryAlgeria 0.129147 0.420600 0.307 0.7588
## CountryAndorra 0.174690 0.455553 0.383 0.7014
## CountryAngola -0.384319 0.438937 -0.876 0.3813
## CountryAntigua and Barbuda -0.220430 0.438400 -0.503 0.6151
## CountryArmenia 0.179499 0.434502 0.413 0.6795
## CountryAustralia 0.572109 0.434332 1.317 0.1878
## CountryAustria 0.050946 0.420683 0.121 0.9036
## CountryAzerbaijan 0.584544 0.420818 1.389 0.1648
## CountryBahrain -0.454549 0.421998 -1.077 0.2814
## CountryBangladesh 0.595475 0.414599 1.436 0.1509
## CountryBarbados 0.242312 0.431636 0.561 0.5745
## CountryBelarus 0.221259 0.431600 0.513 0.6082
## CountryBelgium -0.106862 0.445472 -0.240 0.8104
## CountryBelize 0.293237 0.448407 0.654 0.5131
## CountryBenin -0.206775 0.414364 -0.499 0.6178
## CountryBhutan 0.327517 0.426716 0.768 0.4428
## CountryBosnia and Herzegovina -0.485337 0.426441 -1.138 0.2551
## CountryBotswana 0.430343 0.402105 1.070 0.2845
## CountryBrunei 0.404662 0.422584 0.958 0.3383
## CountryBulgaria 0.089878 0.447623 0.201 0.8409
## CountryBurkina Faso 0.506404 0.429898 1.178 0.2388
## CountryBurundi -0.366183 0.407710 -0.898 0.3691
## CountryCambodia 0.718402 0.415766 1.728 0.0840 .
## CountryCameroon -0.333279 0.450438 -0.740 0.4594
## CountryCanada -0.207151 0.389917 -0.531 0.5952
## CountryCape Verde 0.324849 0.448728 0.724 0.4691
## CountryCentral African Republic -0.032155 0.441636 -0.073 0.9420
## CountryChad 0.461329 0.447495 1.031 0.3026
## CountryChina 0.122602 0.420779 0.291 0.7708
## CountryComoros 0.187092 0.440578 0.425 0.6711
## CountryCosta Rica 0.347829 0.438830 0.793 0.4280
## CountryCote d'Ivoire 0.369700 0.438622 0.843 0.3993
## CountryCroatia 0.059983 0.396882 0.151 0.8799
## CountryCuba 0.020201 0.425920 0.047 0.9622
## CountryCyprus 0.223972 0.437554 0.512 0.6087
## CountryCzech Republic -0.258961 0.430015 -0.602 0.5470
## CountryDemocratic Republic of the Congo 0.026567 0.402122 0.066 0.9473
## CountryDenmark 0.284055 0.404259 0.703 0.4823
## CountryDjibouti 0.379381 0.417740 0.908 0.3638
## CountryDominica -0.054182 0.416564 -0.130 0.8965
## CountryDominican Republic -0.025499 0.429149 -0.059 0.9526
## CountryEast Timor 1.059667 0.475582 2.228 0.0259 *
## CountryEgypt 0.029999 0.416076 0.072 0.9425
## CountryEl Salvador 0.167878 0.401753 0.418 0.6760
## CountryEquatorial Guinea 0.140472 0.437280 0.321 0.7480
## CountryEritrea 0.732428 0.463487 1.580 0.1140
## CountryEstonia -0.242958 0.429932 -0.565 0.5720
## CountryEthiopia 0.124850 0.415687 0.300 0.7639
## CountryFederated States of Micronesia 0.045999 0.425957 0.108 0.9140
## CountryFiji 0.751068 0.426241 1.762 0.0781 .
## CountryFinland 0.323784 0.421532 0.768 0.4424
## CountryFrance 0.035080 0.420941 0.083 0.9336
## CountryGabon -0.058534 0.432771 -0.135 0.8924
## CountryGeorgia 0.130716 0.420496 0.311 0.7559
## CountryGermany 0.316967 0.421435 0.752 0.4520
## CountryGhana 0.446371 0.426339 1.047 0.2951
## CountryGreece 0.248792 0.431598 0.576 0.5643
## CountryGreenland -0.168480 0.458253 -0.368 0.7131
## CountryGrenada -0.218457 0.408539 -0.535 0.5928
## CountryGuatemala 0.258926 0.409673 0.632 0.5274
## CountryGuinea 0.471191 0.431778 1.091 0.2751
## CountryGuinea-Bissau -0.082656 0.414910 -0.199 0.8421
## CountryHaiti -0.314074 0.433932 -0.724 0.4692
## CountryHonduras 0.057726 0.407433 0.142 0.8873
## CountryHungary -0.059738 0.421389 -0.142 0.8873
## CountryIceland 0.096666 0.413702 0.234 0.8152
## CountryIndia 0.728228 0.414469 1.757 0.0789 .
## CountryIndonesia -0.277066 0.429874 -0.645 0.5192
## CountryIran 0.384491 0.399658 0.962 0.3360
## CountryIraq 0.455195 0.426127 1.068 0.2854
## CountryIreland 0.544319 0.436287 1.248 0.2122
## CountryIsrael 0.037578 0.411631 0.091 0.9273
## CountryItaly 0.063933 0.447414 0.143 0.8864
## CountryJamaica -0.148747 0.442908 -0.336 0.7370
## CountryJapan 0.173508 0.418183 0.415 0.6782
## CountryJordan 0.683843 0.429160 1.593 0.1111
## CountryKazakhstan 0.820658 0.459458 1.786 0.0741 .
## CountryKenya 0.006786 0.409795 0.017 0.9868
## CountryKiribati -0.155606 0.398540 -0.390 0.6962
## CountryKosovo 0.069345 0.405588 0.171 0.8642
## CountryKuwait 0.001477 0.413771 0.004 0.9972
## CountryKyrgyzstan -0.173719 0.415843 -0.418 0.6761
## CountryLaos -0.023447 0.434714 -0.054 0.9570
## CountryLatvia 0.365470 0.452983 0.807 0.4198
## CountryLebanon 0.381957 0.430069 0.888 0.3745
## CountryLesotho 0.117259 0.411435 0.285 0.7756
## CountryLiberia -0.574866 0.448872 -1.281 0.2003
## CountryLibya 0.033509 0.426264 0.079 0.9373
## CountryLiechtenstein 0.047131 0.407579 0.116 0.9079
## CountryLithuania 0.230487 0.392228 0.588 0.5568
## CountryLuxembourg 0.263031 0.429109 0.613 0.5399
## CountryMacedonia -0.414528 0.456429 -0.908 0.3638
## CountryMadagascar -0.284271 0.422413 -0.673 0.5010
## CountryMalawi 0.230024 0.392408 0.586 0.5577
## CountryMalaysia -0.205752 0.432128 -0.476 0.6340
## CountryMaldives 0.240720 0.426174 0.565 0.5722
## CountryMali 0.390099 0.471056 0.828 0.4076
## CountryMalta 0.575517 0.450398 1.278 0.2013
## CountryMarshall Islands -0.141202 0.450143 -0.314 0.7538
## CountryMauritania 0.323197 0.426782 0.757 0.4489
## CountryMauritius 0.152995 0.413805 0.370 0.7116
## CountryMexico -0.052045 0.441578 -0.118 0.9062
## CountryMoldova -0.226280 0.396340 -0.571 0.5681
## CountryMonaco 0.446451 0.493736 0.904 0.3659
## CountryMongolia -0.010527 0.423529 -0.025 0.9802
## CountryMontenegro 0.196883 0.393598 0.500 0.6169
## CountryMorocco 0.317286 0.395735 0.802 0.4227
## CountryMozambique -0.233389 0.418945 -0.557 0.5775
## CountryMyanmar 0.019857 0.444523 0.045 0.9644
## CountryNamibia 0.269894 0.423680 0.637 0.5241
## CountryNauru 0.133485 0.431119 0.310 0.7568
## CountryNepal 0.392972 0.443393 0.886 0.3755
## CountryNetherlands 0.040905 0.437596 0.093 0.9255
## CountryNew Zealand 0.331040 0.410398 0.807 0.4199
## CountryNicaragua 0.437556 0.467507 0.936 0.3493
## CountryNiger -0.129800 0.417712 -0.311 0.7560
## CountryNigeria 0.018611 0.444288 0.042 0.9666
## CountryNorth Korea -0.123063 0.417743 -0.295 0.7683
## CountryNorway -0.108176 0.445508 -0.243 0.8081
## CountryOman 0.428251 0.422630 1.013 0.3109
## CountryPakistan 0.012131 0.437567 0.028 0.9779
## CountryPalau 0.423564 0.422609 1.002 0.3162
## CountryPanama 0.087359 0.428629 0.204 0.8385
## CountryPapua New Guinea 0.402853 0.450192 0.895 0.3709
## CountryPhilippines -0.147036 0.422679 -0.348 0.7279
## CountryPoland -0.227323 0.418935 -0.543 0.5874
## CountryPortugal 0.048500 0.411753 0.118 0.9062
## CountryQatar -0.114908 0.403596 -0.285 0.7759
## CountryRepublic of the Congo 0.259003 0.414234 0.625 0.5318
## CountryRomania 0.384472 0.430040 0.894 0.3713
## CountryRussia -0.095968 0.419570 -0.229 0.8191
## CountryRwanda 0.209427 0.407586 0.514 0.6074
## CountrySaint Kitts and Nevis -0.252535 0.407165 -0.620 0.5351
## CountrySaint Lucia 0.513987 0.464357 1.107 0.2683
## CountrySaint Vincent and the Grenadines 0.523002 0.445031 1.175 0.2399
## CountrySamoa 0.451333 0.410041 1.101 0.2710
## CountrySan Marino -0.642178 0.455789 -1.409 0.1589
## CountrySao Tome and Principe 0.441181 0.420887 1.048 0.2945
## CountrySaudi Arabia 0.403109 0.428044 0.942 0.3463
## CountrySenegal -0.119740 0.413182 -0.290 0.7720
## CountrySerbia -0.241054 0.424208 -0.568 0.5699
## CountrySeychelles 0.144282 0.395196 0.365 0.7150
## CountrySierra Leone 0.275383 0.435014 0.633 0.5267
## CountrySingapore 0.076476 0.440774 0.174 0.8623
## CountrySlovakia -0.254316 0.445008 -0.571 0.5677
## CountrySlovenia 0.101756 0.405870 0.251 0.8020
## CountrySolomon Islands 0.124176 0.437371 0.284 0.7765
## CountrySomalia 0.260288 0.424030 0.614 0.5393
## CountrySouth Africa 0.174715 0.428429 0.408 0.6834
## CountrySouth Korea 0.440281 0.415871 1.059 0.2897
## CountrySouth Sudan 0.234051 0.431689 0.542 0.5877
## CountrySpain 0.665990 0.430795 1.546 0.1221
## CountrySri Lanka 0.084400 0.418468 0.202 0.8402
## CountrySudan -0.105248 0.419239 -0.251 0.8018
## CountrySwaziland 0.159101 0.405604 0.392 0.6949
## CountrySweden -0.091905 0.453094 -0.203 0.8393
## CountrySwitzerland 0.574127 0.421118 1.363 0.1728
## CountrySyria 0.186420 0.455469 0.409 0.6823
## CountryTaiwan 0.011200 0.406048 0.028 0.9780
## CountryTajikistan 1.231747 0.525996 2.342 0.0192 *
## CountryTanzania 0.175299 0.423204 0.414 0.6787
## CountryThailand 0.407242 0.427935 0.952 0.3413
## CountryThe Bahamas -0.326681 0.434527 -0.752 0.4522
## CountryThe Gambia 0.171782 0.455625 0.377 0.7062
## CountryTogo -0.088768 0.414679 -0.214 0.8305
## CountryTonga 0.133220 0.451647 0.295 0.7680
## CountryTrinidad and Tobago -0.100891 0.424583 -0.238 0.8122
## CountryTunisia -0.287003 0.428000 -0.671 0.5025
## CountryTurkey -0.722408 0.464132 -1.556 0.1196
## CountryTurkmenistan -0.317598 0.421239 -0.754 0.4509
## CountryTuvalu 0.099008 0.405541 0.244 0.8071
## CountryUganda 0.217221 0.426129 0.510 0.6102
## CountryUkraine 0.074004 0.423130 0.175 0.8612
## CountryUnited Arab Emirates -0.509348 0.426431 -1.194 0.2323
## CountryUnited Kingdom 0.328439 0.388523 0.845 0.3979
## CountryUnited States of America 0.347281 0.410467 0.846 0.3975
## CountryUzbekistan 0.367002 0.430103 0.853 0.3935
## CountryVanuatu 0.227412 0.437481 0.520 0.6032
## CountryVatican City 0.250156 0.418610 0.598 0.5501
## CountryVietnam 0.161808 0.401676 0.403 0.6871
## CountryYemen 0.243398 0.451606 0.539 0.5899
## CountryZambia 0.523077 0.411595 1.271 0.2038
## CountryZimbabwe 0.728363 0.414733 1.756 0.0791 .
## `Item Type`Beverages 0.009793 0.113664 0.086 0.9313
## `Item Type`Cereal 0.071865 0.111969 0.642 0.5210
## `Item Type`Clothes -0.132251 0.111096 -1.190 0.2339
## `Item Type`Cosmetics -0.085303 0.111767 -0.763 0.4453
## `Item Type`Fruits 0.049436 0.113402 0.436 0.6629
## `Item Type`Household -0.074539 0.110025 -0.677 0.4981
## `Item Type`Meat -0.104381 0.112417 -0.929 0.3531
## `Item Type`Office Supplies -0.030833 0.110871 -0.278 0.7809
## `Item Type`Personal Care -0.011146 0.109706 -0.102 0.9191
## `Item Type`Snacks -0.081662 0.112331 -0.727 0.4672
## `Item Type`Vegetables -0.141253 0.111490 -1.267 0.2052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11110 on 8013 degrees of freedom
## Residual deviance: 10923 on 7818 degrees of freedom
## AIC: 11315
##
## Number of Fisher Scoring iterations: 4
# Prediction
p1 <- predict(mymodel, train, type = 'response')
head(p1)
## 1 2 3 4 5 6
## 0.5857078 0.5722450 0.4468595 0.4669937 0.5250446 0.4380831
It seems that our predictions from the training set are only about 44% accurate, which is not great.
# Misclassification error - train data
pred1 <- ifelse(p1>0.5, 1, 0)
tab1 <- table(Predicted = pred1, Actual = train$`Sales Channel`)
tab1
## Actual
## Predicted Offline Online
## 0 2217 1728
## 1 1784 2285
1 - sum(diag(tab1))/sum(tab1)
## [1] 0.4382331
The predictions from the test set have improved a bit, increasing to 51%
# Misclassification error - test data
p2 <- predict(mymodel, test, type = 'response')
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(Predicted = pred2, Actual = test$`Sales Channel`)
tab2
## Actual
## Predicted Offline Online
## 0 458 533
## 1 480 515
1 - sum(diag(tab2))/sum(tab2)
## [1] 0.5100705
Additionally, we have also gotten a value of 0.66 for our goodness of fit test, suggesting that the results are not significant and the variable “Sales Channel” is independent from, or has no relationship with “Country” and “Item Type”.
# Goodness-of-fit test
with(mymodel, pchisq(null.deviance - deviance, df.null-df.residual, lower.tail = F))
## [1] 0.6597994
Now we will use the KNN model and see if our predictions can be improved.
# K-NN model
trControl <- trainControl(method = "repeatedcv", #repeated cross-validation
number = 3, # number of resampling iterations
classProbs = TRUE,
summaryFunction = twoClassSummary) # classProbs needed for ROC
set.seed(1234)
mymodel_knn <- train(`Sales Channel` ~ Country + `Item Type`,
data = train,
method = "knn",
metric = "ROC",
trControl = trControl,
)
As we can see below, the ROC is close to 50% with k=9. This suggests that this model has not improved our predictions much more than the logistic regression model.
# Model performance
mymodel_knn
## k-Nearest Neighbors
##
## 8014 samples
## 2 predictor
## 2 classes: 'Offline', 'Online'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 1 times)
## Summary of sample sizes: 5342, 5343, 5343
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.4949696 0.5003738 0.4829309
## 7 0.4952805 0.4826272 0.5028699
## 9 0.4976715 0.4718764 0.5242975
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
In the following graph we can observe how the ROC improves from 7 to 9 neighbors.
plot(mymodel_knn)
p3 <- predict(mymodel_knn, newdata = test )
The accuracy has also not improved compared to the logistic regression model on our test set:
confusionMatrix(p3, test$`Sales Channel`, positive = 'Offline' )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Offline Online
## Offline 449 514
## Online 489 534
##
## Accuracy : 0.495
## 95% CI : (0.4727, 0.5172)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : 0.9984
##
## Kappa : -0.0118
##
## Mcnemar's Test P-Value : 0.4486
##
## Sensitivity : 0.4787
## Specificity : 0.5095
## Pos Pred Value : 0.4663
## Neg Pred Value : 0.5220
## Prevalence : 0.4723
## Detection Rate : 0.2261
## Detection Prevalence : 0.4849
## Balanced Accuracy : 0.4941
##
## 'Positive' Class : Offline
##
Now we will use the larger data set to find out if adding data to our model will improve its predictive accuracy. I have decided to keep using the logistic regression model as it gave us slightly better results than the KNN model when using the test set.
Our larger data set contains 100,000 observations and the same 14 variables as our previous smaller data set.
dim(sales_large)
## [1] 100000 14
We partition the data as previously done using the “sales_prepared2” data set that we transformed earlier.
# Partition data - train (80%) & test (20%)
set.seed(12345)
ind <- sample(2, nrow(sales_prepared2), replace = T, prob = c(0.8, 0.2))
train2 <- sales_prepared2[ind==1,]
test2 <- sales_prepared2[ind==2,]
We run the logistic regression model below:
# Logistic regression model
mymodel2 <- glm(`Sales Channel` ~ Country + `Item Type`, data = train2, family = 'binomial')
# Prediction
p4 <- predict(mymodel2, train2, type = 'response')
head(p4)
## 1 2 3 4 5 6
## 0.5041427 0.4736320 0.5033300 0.4812581 0.5111808 0.4865881
The accuracy of our model does not seem to have improved with more data.
# Misclassification error - train data
pred4 <- ifelse(p4>0.5, 1, 0)
tab4 <- table(Predicted = pred4, Actual = train2$`Sales Channel`)
tab4
## Actual
## Predicted Offline Online
## 0 19549 18076
## 1 20415 21958
1 - sum(diag(tab4))/sum(tab4)
## [1] 0.4811495
The predictions from the test set have improved a bit, increasing to almost 50%. However, we did not see a big chance in accuracy when adding more data to our model.
# Misclassification error - test data
p5 <- predict(mymodel2, test2, type = 'response')
pred5 <- ifelse(p5>0.5, 1, 0)
tab5 <- table(Predicted = pred5, Actual = test2$`Sales Channel`)
tab5
## Actual
## Predicted Offline Online
## 0 4761 4735
## 1 5221 5285
1 - sum(diag(tab5))/sum(tab5)
## [1] 0.4977502
After executing different models on different sizes of data sets, I have not been able to determine that adding or having a bigger data set improves the efficacy of a model. Perhaps the data I started with was already large enough and it did not make a difference how much more data was included in the model. Additionally, the models have room for improvement if we consider some of the other predictor variables in the data. We could potentially also predict “Units Sold” based on our other variables and could be an interesting problem to solve.