Assignment 1

Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).

https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

File Selection

Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)

I picked two files, one containing \(100\) records and the other containing \(100,000\) sales records.

Data review

Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.

small_ds <- read.csv(here('Homework1','data','100_Sales_Records.csv'))
large_ds <- read.csv(here('Homework1','data','100000_Sales_Records.csv'))

Let’s evaluate the Data

dim(small_ds)
## [1] 100  14

The small dataset has \(100\) records with \(14\) columns each.

Types of Attributes

# list types for each attribute
sapply(small_ds, class)
##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##    "character"    "character"    "character"    "character"    "character" 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##    "character"      "integer"    "character"      "integer"      "numeric" 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##      "numeric"      "numeric"      "numeric"      "numeric"

It is also always a good idea to actually eyeball your data.

# take a peek at the first 5 rows of the data
head(small_ds)
##                              Region               Country       Item.Type
## 1             Australia and Oceania                Tuvalu       Baby Food
## 2 Central America and the Caribbean               Grenada          Cereal
## 3                            Europe                Russia Office Supplies
## 4                Sub-Saharan Africa Sao Tome and Principe          Fruits
## 5                Sub-Saharan Africa                Rwanda Office Supplies
## 6             Australia and Oceania       Solomon Islands       Baby Food
##   Sales.Channel Order.Priority Order.Date  Order.ID Ship.Date Units.Sold
## 1       Offline              H  5/28/2010 669165933 6/27/2010       9925
## 2        Online              C  8/22/2012 963881480 9/15/2012       2804
## 3       Offline              L   5/2/2014 341417157  5/8/2014       1779
## 4        Online              C  6/20/2014 514321792  7/5/2014       8102
## 5       Offline              L   2/1/2013 115456712  2/6/2013       5062
## 6        Online              C   2/4/2015 547995746 2/21/2015       2974
##   Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1     255.28    159.42    2533654.00 1582243.50    951410.50
## 2     205.70    117.11     576782.80  328376.44    248406.36
## 3     651.21    524.96    1158502.59  933903.84    224598.75
## 4       9.33      6.92      75591.66   56065.84     19525.82
## 5     651.21    524.96    3296425.02 2657347.52    639077.50
## 6     255.28    159.42     759202.72  474115.08    285087.64

For this exercise, I will be looking to maximize profits. To that effect, we will add a factor categorizing profits as Low, Medium, High.

Category <- cut(small_ds$Total.Profit, breaks = 3,
                         labels = c("low", "medium", "high"))
small_df <- data.frame(small_ds, Category)
# We need to perform the cut for the large ds as well
Category <- cut(large_ds$Total.Profit, breaks = 3,
                         labels = c("low", "medium", "high"))
large_df <- data.frame(large_ds, Category)

Category Distribution

Let’s now take a look at the number of instances (rows) that belong to each category. We can view this as an absolute count and as a percentage.

# summarize the category distribution
percentage <- prop.table(table(small_df$Category)) * 100
cbind(freq=table(small_df$Species), percentage=percentage)
##        percentage
## low            69
## medium         20
## high           11

We can see that, for the small dataset, \(69%\) of the entries correspond to small profit entries. Let’s perform the same analysis for the large dataset:

# summarize the category distribution
percentage <- prop.table(table(large_df$Category)) * 100
cbind(freq=table(large_df$Species), percentage=percentage)
##        percentage
## low        75.720
## medium     18.271
## high        6.009

In this case, the high profit entries are even lower.

Statistical Summary

Let’s also take a look at a summary of each attribute.

# summarize attribute distributions
summary(small_df)
##     Region            Country           Item.Type         Sales.Channel     
##  Length:100         Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:100         Length:100         Min.   :114606559   Length:100        
##  Class :character   Class :character   1st Qu.:338922488   Class :character  
##  Mode  :character   Mode  :character   Median :557708561   Mode  :character  
##                                        Mean   :555020412                     
##                                        3rd Qu.:790755081                     
##                                        Max.   :994022214                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   : 124   Min.   :  9.33   Min.   :  6.92   Min.   :   4870  
##  1st Qu.:2836   1st Qu.: 81.73   1st Qu.: 35.84   1st Qu.: 268721  
##  Median :5382   Median :179.88   Median :107.28   Median : 752314  
##  Mean   :5129   Mean   :276.76   Mean   :191.05   Mean   :1373488  
##  3rd Qu.:7369   3rd Qu.:437.20   3rd Qu.:263.33   3rd Qu.:2212045  
##  Max.   :9925   Max.   :668.27   Max.   :524.96   Max.   :5997055  
##    Total.Cost       Total.Profit       Category 
##  Min.   :   3612   Min.   :   1258   low   :69  
##  1st Qu.: 168868   1st Qu.: 121444   medium:20  
##  Median : 363566   Median : 290768   high  :11  
##  Mean   : 931806   Mean   : 441682              
##  3rd Qu.:1613870   3rd Qu.: 635829              
##  Max.   :4509794   Max.   :1719922

We can see that the numerical values have wide ranges. Our target variable, \(Total.Profit\) goes from a $1000 to almost 2 million dollars with a median of a little over a quarter million dollars.

Visualize Dataset

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

Univariate plots to better understand each attribute. Multivariate plots to better understand the relationships between attributes.

Univariate Plots We start with some univariate plots, that is, plots of each individual variable.

It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or class) y.

# split input and output
x <- small_df[,9:14]
y <- small_df[,15]

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

# boxplot for each attribute on one image
par(mfrow=c(1,6))
  for(i in 9:14) {
    idx <- (i - 8)
    boxplot(x[,idx], main = names(small_df)[i])
  }

This confirms what we learned in the last section, that the high profit entries are the smallest of the transactions.

# barplot for class breakdown
plot(y)

We can also look at the interactions between the variables.

First let’s look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that points for each class are generally separate, we can draw ellipses around them.

We can see some clear relationships between the input attributes (trends) and between attributes and the class values (ellipses):

# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")

Scatterplot Matrix of Small DataSet in R

# box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")

Box and Whisker Plot of Iris data by Class Value

# density plots for each attribute by category value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)

Density Plots of Small DataSet

Algorithm Selection

Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

#recode character variables to factors
small_df$Region <- as.factor(small_df$Region)
small_df$Item.Type <- as.factor(small_df$Item.Type)
small_df$Sales.Channel <- as.factor(small_df$Sales.Channel)
small_df$Order.Priority <- as.factor(small_df$Order.Priority)


# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(small_df$Category, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- small_df[-validation_index,]
# use the remaining 80% of data to training and testing the models
small_df <- small_df[validation_index,]
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.

Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 2 different algorithms:

Linear Discriminant Analysis (LDA) Random Forest (RF)

We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build our two models:

# a) linear algorithms
set.seed(7)
fit.lda <- train(Category~ Order.Priority + Region + Item.Type + Unit.Cost, data=small_df, method="lda", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Category~Order.Priority + Region + Item.Type + Unit.Cost, data=small_df, method="rf", metric=metric, trControl=control)

For some reason, the LDA model did not build as expected, so I used a subset of factors.

Select Best Model

We now have 2 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

We can see the accuracy of each classifier and also other metrics like Kappa:

# summarize accuracy of models
results <- resamples(list(lda=fit.lda, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, rf 
## Number of resamples: 10 
## 
## Accuracy 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda 0.125 0.5000000 0.6250000 0.6047178 0.7777778 0.8750000    1
## rf  0.500 0.5729167 0.6458333 0.6599206 0.7410714 0.8571429    0
## 
## Kappa 
##           Min.     1st Qu.     Median        Mean 3rd Qu.      Max. NA's
## lda -0.4358974  0.04545455 0.08571429  0.17996066   0.325 0.7419355    1
## rf  -0.1851852 -0.09375000 0.00000000 -0.02492533   0.000 0.2222222    0

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

# compare accuracy of models
dotplot(results)

It appears the most accurate model was Random Forest.

print(fit.rf)
## Random Forest 
## 
## 81 samples
##  4 predictor
##  3 classes: 'low', 'medium', 'high' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 72, 73, 73, 72, 74, 73, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##    2    0.6599206  -0.02492533
##   11    0.5529762  -0.02981102
##   21    0.5793651   0.04485915
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(fit.lda)
## Linear Discriminant Analysis 
## 
## 81 samples
##  4 predictor
##  3 classes: 'low', 'medium', 'high' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 72, 73, 73, 72, 74, 73, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6047178  0.1799607

Make Predictions

# estimate skill of LDA on the validation dataset
predictions <- predict(fit.rf, validation)
confusionMatrix(predictions, validation$Category)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low medium high
##     low     13      4    2
##     medium   0      0    0
##     high     0      0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6842          
##                  95% CI : (0.4345, 0.8742)
##     No Information Rate : 0.6842          
##     P-Value [Acc > NIR] : 0.6082          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity              1.0000        0.0000      0.0000
## Specificity              0.0000        1.0000      1.0000
## Pos Pred Value           0.6842           NaN         NaN
## Neg Pred Value              NaN        0.7895      0.8947
## Prevalence               0.6842        0.2105      0.1053
## Detection Rate           0.6842        0.0000      0.0000
## Detection Prevalence     1.0000        0.0000      0.0000
## Balanced Accuracy        0.5000        0.5000      0.5000

We can see that the accuracy is 73%. It was a small validation dataset (20%).

What happens when we apply this model to the large dataset?

# estimate skill of LDA on the validation dataset
predict_large <- predict(fit.rf, large_df)
confusionMatrix(predict_large, large_df$Category)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   low medium  high
##     low    74828  17313  5174
##     medium   191    217   167
##     high     701    741   668
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7571          
##                  95% CI : (0.7545, 0.7598)
##     No Information Rate : 0.7572          
##     P-Value [Acc > NIR] : 0.5223          
##                                           
##                   Kappa : 0.0688          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity             0.98822       0.01188     0.11117
## Specificity             0.07385       0.99562     0.98466
## Pos Pred Value          0.76893       0.37739     0.31659
## Neg Pred Value          0.66778       0.81842     0.94544
## Prevalence              0.75720       0.18271     0.06009
## Detection Rate          0.74828       0.00217     0.00668
## Detection Prevalence    0.97315       0.00575     0.02110
## Balanced Accuracy       0.53103       0.50375     0.54791

Conclusion

We achieved an accuracy of 76% on the large data set, which indicates that our model and the Random Forest might the better of the two algorithms for this dataset.