Visit the following website and explore the range of sizes of this dataset (from 100 to 5 million records).
https://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/
Based on your computer’s capabilities (memory, CPU), select 2 files you can handle (recommended one small, one large)
I picked two files, one containing \(100\) records and the other containing \(100,000\) sales records.
Review the structure and content of the tables, and think which two machine learning algorithms presented so far could be used to analyze the data, and how can they be applied in the suggested environment of the datasets.
small_ds <- read.csv(here('Homework1','data','100_Sales_Records.csv'))
large_ds <- read.csv(here('Homework1','data','100000_Sales_Records.csv'))
dim(small_ds)
## [1] 100 14
The small dataset has \(100\) records with \(14\) columns each.
# list types for each attribute
sapply(small_ds, class)
## Region Country Item.Type Sales.Channel Order.Priority
## "character" "character" "character" "character" "character"
## Order.Date Order.ID Ship.Date Units.Sold Unit.Price
## "character" "integer" "character" "integer" "numeric"
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## "numeric" "numeric" "numeric" "numeric"
It is also always a good idea to actually eyeball your data.
# take a peek at the first 5 rows of the data
head(small_ds)
## Region Country Item.Type
## 1 Australia and Oceania Tuvalu Baby Food
## 2 Central America and the Caribbean Grenada Cereal
## 3 Europe Russia Office Supplies
## 4 Sub-Saharan Africa Sao Tome and Principe Fruits
## 5 Sub-Saharan Africa Rwanda Office Supplies
## 6 Australia and Oceania Solomon Islands Baby Food
## Sales.Channel Order.Priority Order.Date Order.ID Ship.Date Units.Sold
## 1 Offline H 5/28/2010 669165933 6/27/2010 9925
## 2 Online C 8/22/2012 963881480 9/15/2012 2804
## 3 Offline L 5/2/2014 341417157 5/8/2014 1779
## 4 Online C 6/20/2014 514321792 7/5/2014 8102
## 5 Offline L 2/1/2013 115456712 2/6/2013 5062
## 6 Online C 2/4/2015 547995746 2/21/2015 2974
## Unit.Price Unit.Cost Total.Revenue Total.Cost Total.Profit
## 1 255.28 159.42 2533654.00 1582243.50 951410.50
## 2 205.70 117.11 576782.80 328376.44 248406.36
## 3 651.21 524.96 1158502.59 933903.84 224598.75
## 4 9.33 6.92 75591.66 56065.84 19525.82
## 5 651.21 524.96 3296425.02 2657347.52 639077.50
## 6 255.28 159.42 759202.72 474115.08 285087.64
For this exercise, I will be looking to maximize profits. To that effect, we will add a factor categorizing profits as Low, Medium, High.
Category <- cut(small_ds$Total.Profit, breaks = 3,
labels = c("low", "medium", "high"))
small_df <- data.frame(small_ds, Category)
# We need to perform the cut for the large ds as well
Category <- cut(large_ds$Total.Profit, breaks = 3,
labels = c("low", "medium", "high"))
large_df <- data.frame(large_ds, Category)
Let’s now take a look at the number of instances (rows) that belong to each category. We can view this as an absolute count and as a percentage.
# summarize the category distribution
percentage <- prop.table(table(small_df$Category)) * 100
cbind(freq=table(small_df$Species), percentage=percentage)
## percentage
## low 69
## medium 20
## high 11
We can see that, for the small dataset, \(69%\) of the entries correspond to small profit entries. Let’s perform the same analysis for the large dataset:
# summarize the category distribution
percentage <- prop.table(table(large_df$Category)) * 100
cbind(freq=table(large_df$Species), percentage=percentage)
## percentage
## low 75.720
## medium 18.271
## high 6.009
In this case, the high profit entries are even lower.
Let’s also take a look at a summary of each attribute.
# summarize attribute distributions
summary(small_df)
## Region Country Item.Type Sales.Channel
## Length:100 Length:100 Length:100 Length:100
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:100 Length:100 Min. :114606559 Length:100
## Class :character Class :character 1st Qu.:338922488 Class :character
## Mode :character Mode :character Median :557708561 Mode :character
## Mean :555020412
## 3rd Qu.:790755081
## Max. :994022214
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 124 Min. : 9.33 Min. : 6.92 Min. : 4870
## 1st Qu.:2836 1st Qu.: 81.73 1st Qu.: 35.84 1st Qu.: 268721
## Median :5382 Median :179.88 Median :107.28 Median : 752314
## Mean :5129 Mean :276.76 Mean :191.05 Mean :1373488
## 3rd Qu.:7369 3rd Qu.:437.20 3rd Qu.:263.33 3rd Qu.:2212045
## Max. :9925 Max. :668.27 Max. :524.96 Max. :5997055
## Total.Cost Total.Profit Category
## Min. : 3612 Min. : 1258 low :69
## 1st Qu.: 168868 1st Qu.: 121444 medium:20
## Median : 363566 Median : 290768 high :11
## Mean : 931806 Mean : 441682
## 3rd Qu.:1613870 3rd Qu.: 635829
## Max. :4509794 Max. :1719922
We can see that the numerical values have wide ranges. Our target variable, \(Total.Profit\) goes from a $1000 to almost 2 million dollars with a median of a little over a quarter million dollars.
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
Univariate plots to better understand each attribute. Multivariate plots to better understand the relationships between attributes.
Univariate Plots We start with some univariate plots, that is, plots of each individual variable.
It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or class) y.
# split input and output
x <- small_df[,9:14]
y <- small_df[,15]
Given that the input variables are numeric, we can create box and whisker plots of each.
This gives us a much clearer idea of the distribution of the input attributes:
# boxplot for each attribute on one image
par(mfrow=c(1,6))
for(i in 9:14) {
idx <- (i - 8)
boxplot(x[,idx], main = names(small_df)[i])
}
This confirms what we learned in the last section, that the high profit entries are the smallest of the transactions.
# barplot for class breakdown
plot(y)
We can also look at the interactions between the variables.
First let’s look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that points for each class are generally separate, we can draw ellipses around them.
We can see some clear relationships between the input attributes (trends) and between attributes and the class values (ellipses):
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")
Scatterplot Matrix of Small DataSet in R
# box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")
Box and Whisker Plot of Iris data by Class Value
# density plots for each attribute by category value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
Density Plots of Small DataSet
Write a short essay explaining your selection. Then, select one of the 2 algorithms and explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. Which result will you trust if you need to make a business decision? Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
#recode character variables to factors
small_df$Region <- as.factor(small_df$Region)
small_df$Item.Type <- as.factor(small_df$Item.Type)
small_df$Sales.Channel <- as.factor(small_df$Sales.Channel)
small_df$Order.Priority <- as.factor(small_df$Order.Priority)
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(small_df$Category, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- small_df[-validation_index,]
# use the remaining 80% of data to training and testing the models
small_df <- small_df[validation_index,]
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 2 different algorithms:
Linear Discriminant Analysis (LDA) Random Forest (RF)
We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build our two models:
# a) linear algorithms
set.seed(7)
fit.lda <- train(Category~ Order.Priority + Region + Item.Type + Unit.Cost, data=small_df, method="lda", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Category~Order.Priority + Region + Item.Type + Unit.Cost, data=small_df, method="rf", metric=metric, trControl=control)
For some reason, the LDA model did not build as expected, so I used a subset of factors.
We now have 2 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
We can see the accuracy of each classifier and also other metrics like Kappa:
# summarize accuracy of models
results <- resamples(list(lda=fit.lda, rf=fit.rf))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.125 0.5000000 0.6250000 0.6047178 0.7777778 0.8750000 1
## rf 0.500 0.5729167 0.6458333 0.6599206 0.7410714 0.8571429 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda -0.4358974 0.04545455 0.08571429 0.17996066 0.325 0.7419355 1
## rf -0.1851852 -0.09375000 0.00000000 -0.02492533 0.000 0.2222222 0
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
# compare accuracy of models
dotplot(results)
It appears the most accurate model was Random Forest.
print(fit.rf)
## Random Forest
##
## 81 samples
## 4 predictor
## 3 classes: 'low', 'medium', 'high'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 72, 73, 73, 72, 74, 73, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6599206 -0.02492533
## 11 0.5529762 -0.02981102
## 21 0.5793651 0.04485915
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(fit.lda)
## Linear Discriminant Analysis
##
## 81 samples
## 4 predictor
## 3 classes: 'low', 'medium', 'high'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 72, 73, 73, 72, 74, 73, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6047178 0.1799607
# estimate skill of LDA on the validation dataset
predictions <- predict(fit.rf, validation)
confusionMatrix(predictions, validation$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 13 4 2
## medium 0 0 0
## high 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6842
## 95% CI : (0.4345, 0.8742)
## No Information Rate : 0.6842
## P-Value [Acc > NIR] : 0.6082
##
## Kappa : 0
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 1.0000 0.0000 0.0000
## Specificity 0.0000 1.0000 1.0000
## Pos Pred Value 0.6842 NaN NaN
## Neg Pred Value NaN 0.7895 0.8947
## Prevalence 0.6842 0.2105 0.1053
## Detection Rate 0.6842 0.0000 0.0000
## Detection Prevalence 1.0000 0.0000 0.0000
## Balanced Accuracy 0.5000 0.5000 0.5000
We can see that the accuracy is 73%. It was a small validation dataset (20%).
What happens when we apply this model to the large dataset?
# estimate skill of LDA on the validation dataset
predict_large <- predict(fit.rf, large_df)
confusionMatrix(predict_large, large_df$Category)
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium high
## low 74828 17313 5174
## medium 191 217 167
## high 701 741 668
##
## Overall Statistics
##
## Accuracy : 0.7571
## 95% CI : (0.7545, 0.7598)
## No Information Rate : 0.7572
## P-Value [Acc > NIR] : 0.5223
##
## Kappa : 0.0688
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: low Class: medium Class: high
## Sensitivity 0.98822 0.01188 0.11117
## Specificity 0.07385 0.99562 0.98466
## Pos Pred Value 0.76893 0.37739 0.31659
## Neg Pred Value 0.66778 0.81842 0.94544
## Prevalence 0.75720 0.18271 0.06009
## Detection Rate 0.74828 0.00217 0.00668
## Detection Prevalence 0.97315 0.00575 0.02110
## Balanced Accuracy 0.53103 0.50375 0.54791
We achieved an accuracy of 76% on the large data set, which indicates that our model and the Random Forest might the better of the two algorithms for this dataset.