Introduction

This project was done alongside the Tableau assignment in order to practice some R programming with exploratory data analysis. The data set used is the full superstore data containing information on there sales, profit and other metrics of their stores around the United States of America. We want to find out what impacts their profits and if we can build a model to predict profit.

Libraries used for our little project. Although as we move the project will add more libraries.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6      v purrr   0.3.5 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.4.1 
## v readr   2.1.3      v forcats 0.5.2
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.2
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggeasy)
## Warning: package 'ggeasy' was built under R version 4.1.3

Extract, transform and load the data

Read the excel file containing the data that we are going to explore.

df <- readxl::read_xlsx("Full-Sales-Superstore-Dataset.xlsx")
head(df)

Now that the data has been moved into a dataframe in R, we shall start by looking at the column names.

names(df)
##  [1] "Category"          "City"              "Country"          
##  [4] "Customer Name"     "Manufacturer"      "Order Date"       
##  [7] "Order ID"          "Postal Code"       "Product Name"     
## [10] "Region"            "Segment"           "Ship Date"        
## [13] "Ship Mode"         "State"             "Sub-Category"     
## [16] "Discount"          "Number of Records" "Profit"           
## [19] "Profit Ratio"      "Quantity"          "Sales"

To make it easier to manipulate the data in the dataframe, I will have to change the columns name into lower case and introduce an underscore where a space or dash is present.

names(df) <- tolower(names(df))
names(df) <- gsub(" ","_", names(df),)
names(df) <- sub("-","_", names(df),)
names(df)
##  [1] "category"          "city"              "country"          
##  [4] "customer_name"     "manufacturer"      "order_date"       
##  [7] "order_id"          "postal_code"       "product_name"     
## [10] "region"            "segment"           "ship_date"        
## [13] "ship_mode"         "state"             "sub_category"     
## [16] "discount"          "number_of_records" "profit"           
## [19] "profit_ratio"      "quantity"          "sales"

Change some of the data types of the data.

df <- df %>% mutate_if(is.character, as.factor)

Get an in depth look at the data.

str(df)
## tibble [9,994 x 21] (S3: tbl_df/tbl/data.frame)
##  $ category         : Factor w/ 3 levels "Furniture","Office Supplies",..: 2 2 2 2 2 2 2 1 2 2 ...
##  $ city             : Factor w/ 531 levels "Aberdeen","Abilene",..: 208 322 322 322 375 21 267 195 195 195 ...
##  $ country          : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
##  $ customer_name    : Factor w/ 793 levels "Aaron Bergman",..: 199 612 612 612 538 342 481 487 487 487 ...
##  $ manufacturer     : Factor w/ 174 levels "3D Systems","3M",..: 101 63 20 135 20 49 171 65 133 80 ...
##  $ order_date       : POSIXct[1:9994], format: "2011-01-04" "2011-01-05" ...
##  $ order_id         : Factor w/ 5009 levels "CA-2011-100006",..: 57 162 162 162 512 84 381 777 777 777 ...
##  $ postal_code      : num [1:9994] 77095 60540 60540 60540 19143 ...
##  $ product_name     : Factor w/ 1841 levels "\"While you Were Out\" Message Book, One Form per Page",..: 1086 708 217 1371 250 521 1821 742 1358 895 ...
##  $ region           : Factor w/ 4 levels "Central","East",..: 1 1 1 1 2 3 4 3 3 3 ...
##  $ segment          : Factor w/ 3 levels "Consumer","Corporate",..: 1 3 3 3 1 2 1 3 3 3 ...
##  $ ship_date        : POSIXct[1:9994], format: "2011-01-08" "2011-01-09" ...
##  $ ship_mode        : Factor w/ 4 levels "First Class",..: 4 4 4 4 4 1 3 4 4 4 ...
##  $ state            : Factor w/ 49 levels "Alabama","Arizona",..: 42 12 12 12 37 10 4 16 16 16 ...
##  $ sub_category     : Factor w/ 17 levels "Accessories",..: 13 4 11 15 3 3 13 6 3 4 ...
##  $ discount         : num [1:9994] 0.2 0.8 0.2 0.2 0.2 0 0 0 0 0 ...
##  $ number_of_records: num [1:9994] 1 1 1 1 1 1 1 1 1 1 ...
##  $ profit           : num [1:9994] 6 -5 4 -65 5 5 9 746 1 274 ...
##  $ profit_ratio     : num [1:9994] 0.34 -1.55 0.36 -0.24 0.25 0.41 0.48 0.29 0.27 0.45 ...
##  $ quantity         : num [1:9994] 2 2 3 3 3 3 3 9 2 2 ...
##  $ sales            : num [1:9994] 16 4 12 273 20 ...

Generate a summary of the data.

summary(df)
##             category               city               country    
##  Furniture      :2121   New York City: 915   United States:9994  
##  Office Supplies:6026   Los Angeles  : 747                       
##  Technology     :1847   Philadelphia : 537                       
##                         San Francisco: 510                       
##                         Seattle      : 428                       
##                         Houston      : 377                       
##                         (Other)      :6480                       
##              customer_name   manufacturer    order_date                 
##  William Brown      :  37   Other  :2074   Min.   :2011-01-04 00:00:00  
##  John Lee           :  34   Xerox  : 859   1st Qu.:2012-05-23 00:00:00  
##  Matt Abelman       :  34   Avery  : 557   Median :2013-06-27 00:00:00  
##  Paul Prost         :  34   GBC    : 332   Mean   :2013-04-30 19:20:02  
##  Chloris Kastensmidt:  32   Global : 291   3rd Qu.:2014-05-15 00:00:00  
##  Edward Hooks       :  32   Newell : 276   Max.   :2014-12-31 00:00:00  
##  (Other)            :9791   (Other):5605                                
##            order_id     postal_code   
##  CA-2014-100111:  14   Min.   : 1040  
##  CA-2014-157987:  12   1st Qu.:23223  
##  CA-2013-165330:  11   Median :56431  
##  US-2013-108504:  11   Mean   :55190  
##  CA-2012-131338:  10   3rd Qu.:90008  
##  CA-2013-105732:  10   Max.   :99301  
##  (Other)       :9926                  
##                                                     product_name 
##  Staples                                                  : 227  
##  Avery Non-Stick Binders                                  :  20  
##  KI Adjustable-Height Table                               :  18  
##  Storex Dura Pro Binders                                  :  17  
##  Logitech 910-002974 M325 Wireless Mouse for Web Scrolling:  15  
##  Situations Contoured Folding Chairs, 4/Set               :  15  
##  (Other)                                                  :9682  
##      region            segment       ship_date                  
##  Central:2323   Consumer   :5191   Min.   :2011-01-08 00:00:00  
##  East   :2848   Corporate  :3020   1st Qu.:2012-05-27 00:00:00  
##  South  :1620   Home Office:1783   Median :2013-06-30 00:00:00  
##  West   :3203                      Mean   :2013-05-04 18:20:49  
##                                    3rd Qu.:2014-05-19 00:00:00  
##                                    Max.   :2015-01-06 00:00:00  
##                                                                 
##           ship_mode             state           sub_category     discount     
##  First Class   :1538   California  :2001   Binders    :1523   Min.   :0.0000  
##  Same Day      : 543   New York    :1128   Paper      :1370   1st Qu.:0.0000  
##  Second Class  :1945   Texas       : 985   Furnishings: 957   Median :0.2000  
##  Standard Class:5968   Pennsylvania: 587   Phones     : 889   Mean   :0.1562  
##                        Washington  : 506   Storage    : 846   3rd Qu.:0.2000  
##                        Illinois    : 492   Art        : 796   Max.   :0.8000  
##                        (Other)     :4295   (Other)    :3613                   
##  number_of_records     profit          profit_ratio        quantity    
##  Min.   :1         Min.   :-6600.00   Min.   :-2.7500   Min.   : 1.00  
##  1st Qu.:1         1st Qu.:    2.00   1st Qu.: 0.0700   1st Qu.: 2.00  
##  Median :1         Median :    9.00   Median : 0.2700   Median : 3.00  
##  Mean   :1         Mean   :   28.65   Mean   : 0.1205   Mean   : 3.79  
##  3rd Qu.:1         3rd Qu.:   29.00   3rd Qu.: 0.3600   3rd Qu.: 5.00  
##  Max.   :1         Max.   : 8400.00   Max.   : 0.5000   Max.   :14.00  
##                                                                        
##      sales        
##  Min.   :    0.0  
##  1st Qu.:   17.0  
##  Median :   54.5  
##  Mean   :  229.9  
##  3rd Qu.:  210.0  
##  Max.   :22638.0  
## 

Explore the first 5 rows of the superstore data set.

head(df)

Exploratory analysis

To start the exploratory analysis, we look at the sales of each sub-category.

df %>% group_by(sub_category) %>% summarise(total_sales = sum(sales)) %>% ggplot(aes(x = total_sales/100, y = reorder(sub_category, total_sales ))) + 
    geom_bar(stat = 'identity', aes(fill = sub_category)) +
    theme_light() +
    labs(x = 'Sales',y = 'Sub-category', title = 'Sales by Sub-category') +
    theme(legend.position = 'none')

According to the data set, Phones are the best performing sub-category item while Fasteners are the worst.

Now we look at sales per category.

df %>% group_by(category) %>% summarise(total_sales = sum(sales)) %>% ggplot(aes(x = reorder(category, total_sales), y = total_sales)) +
    geom_bar(stat = 'identity', aes(fill = category)) +
    theme_light() +
    labs(x = 'Category',y = 'Sales', title = 'Sales by Category') +
    theme(legend.position = 'none')

By category, Technology items performed the best.

We will use the order year to see throughout the years how each category performed.

df <- df %>% mutate(order_year = year(order_date))
df %>% group_by(order_year, category) %>% summarise(total = sum(sales)) %>% ggplot(aes(x = order_year, y = total, color = category)) +
    geom_line(size = 1) +
    geom_point(size = 2) +
    theme_light() +
    labs(y = 'Sales',x = 'Category', title = 'Sales of Category by year')
## `summarise()` has grouped output by 'order_year'. You can override using the
## `.groups` argument.

For most years, Technology outperformed the other categories.

Create a bubble chart to compare sales to profit.

library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.1.3
df %>% group_by(sub_category) %>% summarise(profit = sum(profit), sales = sum(sales)) %>% ggplot(aes(x = profit, y = sales)) +
    geom_point(aes(color = sub_category, size = profit*sales)) +
    geom_text_repel(aes(label = sub_category)) +
    theme_light() +
    theme(legend.position = 'none') +
    geom_vline(xintercept = 0, linetype = 'dashed') +
    labs(y = 'Sales',x = 'Profit', title = 'Bubble Chart of Sales and Profit')

Create a lollipop chart of profit.

df %>% group_by(sub_category) %>% summarise(profit = sum(profit)) %>% ggplot(aes(x = sub_category, y = profit)) +
    geom_segment( aes(x = sub_category, xend = sub_category, y = 0, yend = profit), color="skyblue", size = 1) +
    geom_point( color="blue", size=6) +
    theme_light() +
    coord_flip() +
    geom_hline(yintercept = 0, linetype = 'dashed') +
    labs(y = 'Profit',x = 'Sub-category', title = 'Lollipop Chart of Profit')

A boxplot of the categorical items.

df %>% ggplot(aes(x = category, y = sales)) +
    geom_boxplot(aes(fill = category)) +
    theme_light() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    scale_y_continuous(limit = c(0, 2000)) +
    theme(legend.position = 'none')
## Warning: Removed 140 rows containing non-finite values (stat_boxplot).

Machine learning

We are going to use two different model to create a regression model to predict Profit. The models are: 1. Boosted tree 2. Random forest

Select the features that will be used in the model.

df <- df %>% 
    select(profit, profit_ratio, sales, discount, sub_category, quantity, segment, category) %>%
    glimpse()
## Rows: 9,994
## Columns: 8
## $ profit       <dbl> 6, -5, 4, -65, 5, 5, 9, 746, 1, 274, 0, 3, 114, 204, -54,~
## $ profit_ratio <dbl> 0.34, -1.55, 0.36, -0.24, 0.25, 0.41, 0.48, 0.29, 0.27, 0~
## $ sales        <dbl> 16, 4, 12, 273, 20, 13, 19, 2574, 5, 610, 31, 7, 392, 756~
## $ discount     <dbl> 0.2, 0.8, 0.2, 0.2, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.~
## $ sub_category <fct> Paper, Binders, Labels, Storage, Art, Art, Paper, Chairs,~
## $ quantity     <dbl> 2, 2, 3, 3, 3, 3, 3, 9, 2, 2, 4, 1, 2, 4, 3, 7, 2, 3, 1, ~
## $ segment      <fct> Consumer, Home Office, Home Office, Home Office, Consumer~
## $ category     <fct> Office Supplies, Office Supplies, Office Supplies, Office~

Split the data into training and testing sets.

library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.1.3
## -- Attaching packages -------------------------------------- tidymodels 1.0.0 --
## v broom        1.0.1     v rsample      1.1.0
## v dials        1.0.0     v tune         1.0.1
## v infer        1.0.3     v workflows    1.1.0
## v modeldata    1.0.1     v workflowsets 1.0.0
## v parsnip      1.0.2     v yardstick    1.1.0
## v recipes      1.0.2
## Warning: package 'broom' was built under R version 4.1.3
## Warning: package 'dials' was built under R version 4.1.3
## Warning: package 'scales' was built under R version 4.1.3
## Warning: package 'infer' was built under R version 4.1.3
## Warning: package 'modeldata' was built under R version 4.1.3
## Warning: package 'parsnip' was built under R version 4.1.3
## Warning: package 'rsample' was built under R version 4.1.3
## Warning: package 'tune' was built under R version 4.1.3
## Warning: package 'workflows' was built under R version 4.1.3
## Warning: package 'workflowsets' was built under R version 4.1.3
## Warning: package 'yardstick' was built under R version 4.1.3
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## * Learn how to get started at https://www.tidymodels.org/start/
set.seed(222)

data_split <- initial_split(df, prop = 0.6)
train_data <- training(data_split)
test_data <- testing(data_split)

Create a recipe to preprocess the data for modelling.

store_rec <- recipe(profit ~., data = train_data) %>%
    step_center(all_numeric_predictors()) %>%
    step_scale(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors())

Build the first model Boosted tree using caret to choose best hyperparameters.

library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.1.1
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(28)
tree_ctrl <- trainControl(method = 'repeatedcv', number = 10, allowParallel = TRUE)

library(doParallel)
## Warning: package 'doParallel' was built under R version 4.1.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.1.3
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.1.3
## Loading required package: parallel
cl <- makeCluster(5)
registerDoParallel(cl)

# Boosted Generalized Linear Model
set.seed(158)
store_tree <- train(store_rec, train_data,
                    method = "bstTree", 
                    metric = 'Rsquared',
                    tuneLength = 5,
                    trControl = tree_ctrl)
## Loading required namespace: bst
stopCluster(cl)
store_tree
## Boosted Tree 
## 
## 5996 samples
##    7 predictor
## 
## Recipe steps: center, scale, dummy 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5396, 5397, 5397, 5396, 5396, 5396, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  mstop  RMSE      Rsquared   MAE     
##   1          50    212.7000  0.3161023  58.06256
##   1         100    212.6738  0.3149516  58.18301
##   1         150    213.2262  0.3112555  58.04859
##   1         200    213.6949  0.3090752  57.59776
##   1         250    214.5792  0.3061475  57.27373
##   2          50    143.1308  0.6313017  43.45589
##   2         100    120.5391  0.7293149  35.35751
##   2         150    109.8098  0.7792909  30.70317
##   2         200    103.9553  0.8144464  28.51209
##   2         250    101.1264  0.8340683  27.02751
##   3          50    123.9964  0.7207191  28.59222
##   3         100    104.5328  0.7884351  22.77343
##   3         150    103.4057  0.8061641  21.80901
##   3         200    104.0413  0.8103053  21.18649
##   3         250    106.0718  0.8056246  21.14407
##   4          50    126.6806  0.7067453  21.96145
##   4         100    116.3973  0.7373259  19.50782
##   4         150    113.2845  0.7505197  18.21935
##   4         200    111.1251  0.7601242  17.19606
##   4         250    109.3222  0.7743273  16.16433
##   5          50    131.0444  0.6883612  18.26342
##   5         100    125.6315  0.6935223  16.62587
##   5         150    124.2437  0.6940051  16.11937
##   5         200    121.9007  0.7056668  15.43525
##   5         250    120.7176  0.7162976  15.02885
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were mstop = 250, maxdepth = 2 and nu = 0.1.

Plot the variable importance to see which features are used by the model.

ctree_imp <- varImp(store_tree, scale = TRUE)
ggplot(data = ctree_imp) + theme_light() + labs(title = 'Variable importance using Conditional tree') + geom_bar(stat = 'identity', fill = 'blue')

From the data, Sales has been selected as the most important feature for

Build the random forest.

set.seed(45)
ranger_ctrl <- trainControl(method = "cv", number = 5, search = 'random', allowParallel = TRUE)

cl <- makeCluster(5)
registerDoParallel(cl)

# Random forest using Ranger
set.seed(456)
store_rf <- train(store_rec, train_data,
                 method = "ranger", 
                 metric = "Rsquared",
                 tuneLength = 10,
                 trControl = ranger_ctrl)
## Loading required namespace: e1071
## Loading required namespace: ranger
stopCluster(cl)

store_rf
## Random Forest 
## 
## 5996 samples
##    7 predictor
## 
## Recipe steps: center, scale, dummy 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4795, 4799, 4796, 4798, 4796 
## Resampling results across tuning parameters:
## 
##   min.node.size  mtry  splitrule   RMSE      Rsquared   MAE     
##    3              5    extratrees  150.4929  0.7359519  33.54315
##    6             20    maxstat     135.1206  0.7238664  16.36335
##    8              9    maxstat     168.1805  0.5874670  28.72476
##    9             17    maxstat     142.9711  0.6899804  18.52979
##   11              5    variance    148.4411  0.7148978  29.22154
##   13             20    maxstat     143.1775  0.6859963  18.21757
##   14              2    extratrees  199.0185  0.4560838  51.04841
##   14             15    extratrees  113.7400  0.8335718  12.55911
##   15              6    variance    141.2858  0.7351351  24.87669
##   19              8    extratrees  141.9624  0.7354749  24.02120
## 
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 15, splitrule = extratrees
##  and min.node.size = 14.

Evaluate the models

Using the test data to predict and evaluate the model.

bst_pred <- predict(store_tree, test_data)
postResample(bst_pred, obs = test_data$profit)
##       RMSE   Rsquared        MAE 
## 95.5340116  0.8184612 24.9234222

The Boosted tree performed well producing a \(R**2\) of 82% on the test data.

rf_pred <- predict(store_rf, test_data)
postResample(rf_pred, obs = test_data$profit)
##       RMSE   Rsquared        MAE 
## 93.5938946  0.8356959  9.8806871

The Random forest performed better than the other model producing a \(R**2\) of 84% on the test data.

Conclusion

Both models can be used to predict profits using the features listed.