This project was done alongside the Tableau assignment in order to practice some R programming with exploratory data analysis. The data set used is the full superstore data containing information on there sales, profit and other metrics of their stores around the United States of America. We want to find out what impacts their profits and if we can build a model to predict profit.
Libraries used for our little project. Although as we move the project will add more libraries.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.5
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.3 v forcats 0.5.2
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.2
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggeasy)
## Warning: package 'ggeasy' was built under R version 4.1.3
Read the excel file containing the data that we are going to explore.
df <- readxl::read_xlsx("Full-Sales-Superstore-Dataset.xlsx")
head(df)
Now that the data has been moved into a dataframe in R, we shall start by looking at the column names.
names(df)
## [1] "Category" "City" "Country"
## [4] "Customer Name" "Manufacturer" "Order Date"
## [7] "Order ID" "Postal Code" "Product Name"
## [10] "Region" "Segment" "Ship Date"
## [13] "Ship Mode" "State" "Sub-Category"
## [16] "Discount" "Number of Records" "Profit"
## [19] "Profit Ratio" "Quantity" "Sales"
To make it easier to manipulate the data in the dataframe, I will have to change the columns name into lower case and introduce an underscore where a space or dash is present.
names(df) <- tolower(names(df))
names(df) <- gsub(" ","_", names(df),)
names(df) <- sub("-","_", names(df),)
names(df)
## [1] "category" "city" "country"
## [4] "customer_name" "manufacturer" "order_date"
## [7] "order_id" "postal_code" "product_name"
## [10] "region" "segment" "ship_date"
## [13] "ship_mode" "state" "sub_category"
## [16] "discount" "number_of_records" "profit"
## [19] "profit_ratio" "quantity" "sales"
Change some of the data types of the data.
df <- df %>% mutate_if(is.character, as.factor)
Get an in depth look at the data.
str(df)
## tibble [9,994 x 21] (S3: tbl_df/tbl/data.frame)
## $ category : Factor w/ 3 levels "Furniture","Office Supplies",..: 2 2 2 2 2 2 2 1 2 2 ...
## $ city : Factor w/ 531 levels "Aberdeen","Abilene",..: 208 322 322 322 375 21 267 195 195 195 ...
## $ country : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
## $ customer_name : Factor w/ 793 levels "Aaron Bergman",..: 199 612 612 612 538 342 481 487 487 487 ...
## $ manufacturer : Factor w/ 174 levels "3D Systems","3M",..: 101 63 20 135 20 49 171 65 133 80 ...
## $ order_date : POSIXct[1:9994], format: "2011-01-04" "2011-01-05" ...
## $ order_id : Factor w/ 5009 levels "CA-2011-100006",..: 57 162 162 162 512 84 381 777 777 777 ...
## $ postal_code : num [1:9994] 77095 60540 60540 60540 19143 ...
## $ product_name : Factor w/ 1841 levels "\"While you Were Out\" Message Book, One Form per Page",..: 1086 708 217 1371 250 521 1821 742 1358 895 ...
## $ region : Factor w/ 4 levels "Central","East",..: 1 1 1 1 2 3 4 3 3 3 ...
## $ segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 3 3 3 1 2 1 3 3 3 ...
## $ ship_date : POSIXct[1:9994], format: "2011-01-08" "2011-01-09" ...
## $ ship_mode : Factor w/ 4 levels "First Class",..: 4 4 4 4 4 1 3 4 4 4 ...
## $ state : Factor w/ 49 levels "Alabama","Arizona",..: 42 12 12 12 37 10 4 16 16 16 ...
## $ sub_category : Factor w/ 17 levels "Accessories",..: 13 4 11 15 3 3 13 6 3 4 ...
## $ discount : num [1:9994] 0.2 0.8 0.2 0.2 0.2 0 0 0 0 0 ...
## $ number_of_records: num [1:9994] 1 1 1 1 1 1 1 1 1 1 ...
## $ profit : num [1:9994] 6 -5 4 -65 5 5 9 746 1 274 ...
## $ profit_ratio : num [1:9994] 0.34 -1.55 0.36 -0.24 0.25 0.41 0.48 0.29 0.27 0.45 ...
## $ quantity : num [1:9994] 2 2 3 3 3 3 3 9 2 2 ...
## $ sales : num [1:9994] 16 4 12 273 20 ...
Generate a summary of the data.
summary(df)
## category city country
## Furniture :2121 New York City: 915 United States:9994
## Office Supplies:6026 Los Angeles : 747
## Technology :1847 Philadelphia : 537
## San Francisco: 510
## Seattle : 428
## Houston : 377
## (Other) :6480
## customer_name manufacturer order_date
## William Brown : 37 Other :2074 Min. :2011-01-04 00:00:00
## John Lee : 34 Xerox : 859 1st Qu.:2012-05-23 00:00:00
## Matt Abelman : 34 Avery : 557 Median :2013-06-27 00:00:00
## Paul Prost : 34 GBC : 332 Mean :2013-04-30 19:20:02
## Chloris Kastensmidt: 32 Global : 291 3rd Qu.:2014-05-15 00:00:00
## Edward Hooks : 32 Newell : 276 Max. :2014-12-31 00:00:00
## (Other) :9791 (Other):5605
## order_id postal_code
## CA-2014-100111: 14 Min. : 1040
## CA-2014-157987: 12 1st Qu.:23223
## CA-2013-165330: 11 Median :56431
## US-2013-108504: 11 Mean :55190
## CA-2012-131338: 10 3rd Qu.:90008
## CA-2013-105732: 10 Max. :99301
## (Other) :9926
## product_name
## Staples : 227
## Avery Non-Stick Binders : 20
## KI Adjustable-Height Table : 18
## Storex Dura Pro Binders : 17
## Logitech 910-002974 M325 Wireless Mouse for Web Scrolling: 15
## Situations Contoured Folding Chairs, 4/Set : 15
## (Other) :9682
## region segment ship_date
## Central:2323 Consumer :5191 Min. :2011-01-08 00:00:00
## East :2848 Corporate :3020 1st Qu.:2012-05-27 00:00:00
## South :1620 Home Office:1783 Median :2013-06-30 00:00:00
## West :3203 Mean :2013-05-04 18:20:49
## 3rd Qu.:2014-05-19 00:00:00
## Max. :2015-01-06 00:00:00
##
## ship_mode state sub_category discount
## First Class :1538 California :2001 Binders :1523 Min. :0.0000
## Same Day : 543 New York :1128 Paper :1370 1st Qu.:0.0000
## Second Class :1945 Texas : 985 Furnishings: 957 Median :0.2000
## Standard Class:5968 Pennsylvania: 587 Phones : 889 Mean :0.1562
## Washington : 506 Storage : 846 3rd Qu.:0.2000
## Illinois : 492 Art : 796 Max. :0.8000
## (Other) :4295 (Other) :3613
## number_of_records profit profit_ratio quantity
## Min. :1 Min. :-6600.00 Min. :-2.7500 Min. : 1.00
## 1st Qu.:1 1st Qu.: 2.00 1st Qu.: 0.0700 1st Qu.: 2.00
## Median :1 Median : 9.00 Median : 0.2700 Median : 3.00
## Mean :1 Mean : 28.65 Mean : 0.1205 Mean : 3.79
## 3rd Qu.:1 3rd Qu.: 29.00 3rd Qu.: 0.3600 3rd Qu.: 5.00
## Max. :1 Max. : 8400.00 Max. : 0.5000 Max. :14.00
##
## sales
## Min. : 0.0
## 1st Qu.: 17.0
## Median : 54.5
## Mean : 229.9
## 3rd Qu.: 210.0
## Max. :22638.0
##
Explore the first 5 rows of the superstore data set.
head(df)
To start the exploratory analysis, we look at the sales of each sub-category.
df %>% group_by(sub_category) %>% summarise(total_sales = sum(sales)) %>% ggplot(aes(x = total_sales/100, y = reorder(sub_category, total_sales ))) +
geom_bar(stat = 'identity', aes(fill = sub_category)) +
theme_light() +
labs(x = 'Sales',y = 'Sub-category', title = 'Sales by Sub-category') +
theme(legend.position = 'none')
According to the data set, Phones are the best performing sub-category item while Fasteners are the worst.
Now we look at sales per category.
df %>% group_by(category) %>% summarise(total_sales = sum(sales)) %>% ggplot(aes(x = reorder(category, total_sales), y = total_sales)) +
geom_bar(stat = 'identity', aes(fill = category)) +
theme_light() +
labs(x = 'Category',y = 'Sales', title = 'Sales by Category') +
theme(legend.position = 'none')
By category, Technology items performed the best.
We will use the order year to see throughout the years how each category performed.
df <- df %>% mutate(order_year = year(order_date))
df %>% group_by(order_year, category) %>% summarise(total = sum(sales)) %>% ggplot(aes(x = order_year, y = total, color = category)) +
geom_line(size = 1) +
geom_point(size = 2) +
theme_light() +
labs(y = 'Sales',x = 'Category', title = 'Sales of Category by year')
## `summarise()` has grouped output by 'order_year'. You can override using the
## `.groups` argument.
For most years, Technology outperformed the other categories.
Create a bubble chart to compare sales to profit.
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.1.3
df %>% group_by(sub_category) %>% summarise(profit = sum(profit), sales = sum(sales)) %>% ggplot(aes(x = profit, y = sales)) +
geom_point(aes(color = sub_category, size = profit*sales)) +
geom_text_repel(aes(label = sub_category)) +
theme_light() +
theme(legend.position = 'none') +
geom_vline(xintercept = 0, linetype = 'dashed') +
labs(y = 'Sales',x = 'Profit', title = 'Bubble Chart of Sales and Profit')
Create a lollipop chart of profit.
df %>% group_by(sub_category) %>% summarise(profit = sum(profit)) %>% ggplot(aes(x = sub_category, y = profit)) +
geom_segment( aes(x = sub_category, xend = sub_category, y = 0, yend = profit), color="skyblue", size = 1) +
geom_point( color="blue", size=6) +
theme_light() +
coord_flip() +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(y = 'Profit',x = 'Sub-category', title = 'Lollipop Chart of Profit')
A boxplot of the categorical items.
df %>% ggplot(aes(x = category, y = sales)) +
geom_boxplot(aes(fill = category)) +
theme_light() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_continuous(limit = c(0, 2000)) +
theme(legend.position = 'none')
## Warning: Removed 140 rows containing non-finite values (stat_boxplot).
We are going to use two different model to create a regression model to predict Profit. The models are: 1. Boosted tree 2. Random forest
Select the features that will be used in the model.
df <- df %>%
select(profit, profit_ratio, sales, discount, sub_category, quantity, segment, category) %>%
glimpse()
## Rows: 9,994
## Columns: 8
## $ profit <dbl> 6, -5, 4, -65, 5, 5, 9, 746, 1, 274, 0, 3, 114, 204, -54,~
## $ profit_ratio <dbl> 0.34, -1.55, 0.36, -0.24, 0.25, 0.41, 0.48, 0.29, 0.27, 0~
## $ sales <dbl> 16, 4, 12, 273, 20, 13, 19, 2574, 5, 610, 31, 7, 392, 756~
## $ discount <dbl> 0.2, 0.8, 0.2, 0.2, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.~
## $ sub_category <fct> Paper, Binders, Labels, Storage, Art, Art, Paper, Chairs,~
## $ quantity <dbl> 2, 2, 3, 3, 3, 3, 3, 9, 2, 2, 4, 1, 2, 4, 3, 7, 2, 3, 1, ~
## $ segment <fct> Consumer, Home Office, Home Office, Home Office, Consumer~
## $ category <fct> Office Supplies, Office Supplies, Office Supplies, Office~
Split the data into training and testing sets.
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.1.3
## -- Attaching packages -------------------------------------- tidymodels 1.0.0 --
## v broom 1.0.1 v rsample 1.1.0
## v dials 1.0.0 v tune 1.0.1
## v infer 1.0.3 v workflows 1.1.0
## v modeldata 1.0.1 v workflowsets 1.0.0
## v parsnip 1.0.2 v yardstick 1.1.0
## v recipes 1.0.2
## Warning: package 'broom' was built under R version 4.1.3
## Warning: package 'dials' was built under R version 4.1.3
## Warning: package 'scales' was built under R version 4.1.3
## Warning: package 'infer' was built under R version 4.1.3
## Warning: package 'modeldata' was built under R version 4.1.3
## Warning: package 'parsnip' was built under R version 4.1.3
## Warning: package 'rsample' was built under R version 4.1.3
## Warning: package 'tune' was built under R version 4.1.3
## Warning: package 'workflows' was built under R version 4.1.3
## Warning: package 'workflowsets' was built under R version 4.1.3
## Warning: package 'yardstick' was built under R version 4.1.3
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Learn how to get started at https://www.tidymodels.org/start/
set.seed(222)
data_split <- initial_split(df, prop = 0.6)
train_data <- training(data_split)
test_data <- testing(data_split)
Create a recipe to preprocess the data for modelling.
store_rec <- recipe(profit ~., data = train_data) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
Build the first model Boosted tree using caret to choose
best hyperparameters.
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.1.1
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
set.seed(28)
tree_ctrl <- trainControl(method = 'repeatedcv', number = 10, allowParallel = TRUE)
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.1.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.1.3
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.1.3
## Loading required package: parallel
cl <- makeCluster(5)
registerDoParallel(cl)
# Boosted Generalized Linear Model
set.seed(158)
store_tree <- train(store_rec, train_data,
method = "bstTree",
metric = 'Rsquared',
tuneLength = 5,
trControl = tree_ctrl)
## Loading required namespace: bst
stopCluster(cl)
store_tree
## Boosted Tree
##
## 5996 samples
## 7 predictor
##
## Recipe steps: center, scale, dummy
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 5396, 5397, 5397, 5396, 5396, 5396, ...
## Resampling results across tuning parameters:
##
## maxdepth mstop RMSE Rsquared MAE
## 1 50 212.7000 0.3161023 58.06256
## 1 100 212.6738 0.3149516 58.18301
## 1 150 213.2262 0.3112555 58.04859
## 1 200 213.6949 0.3090752 57.59776
## 1 250 214.5792 0.3061475 57.27373
## 2 50 143.1308 0.6313017 43.45589
## 2 100 120.5391 0.7293149 35.35751
## 2 150 109.8098 0.7792909 30.70317
## 2 200 103.9553 0.8144464 28.51209
## 2 250 101.1264 0.8340683 27.02751
## 3 50 123.9964 0.7207191 28.59222
## 3 100 104.5328 0.7884351 22.77343
## 3 150 103.4057 0.8061641 21.80901
## 3 200 104.0413 0.8103053 21.18649
## 3 250 106.0718 0.8056246 21.14407
## 4 50 126.6806 0.7067453 21.96145
## 4 100 116.3973 0.7373259 19.50782
## 4 150 113.2845 0.7505197 18.21935
## 4 200 111.1251 0.7601242 17.19606
## 4 250 109.3222 0.7743273 16.16433
## 5 50 131.0444 0.6883612 18.26342
## 5 100 125.6315 0.6935223 16.62587
## 5 150 124.2437 0.6940051 16.11937
## 5 200 121.9007 0.7056668 15.43525
## 5 250 120.7176 0.7162976 15.02885
##
## Tuning parameter 'nu' was held constant at a value of 0.1
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were mstop = 250, maxdepth = 2 and nu = 0.1.
Plot the variable importance to see which features are used by the model.
ctree_imp <- varImp(store_tree, scale = TRUE)
ggplot(data = ctree_imp) + theme_light() + labs(title = 'Variable importance using Conditional tree') + geom_bar(stat = 'identity', fill = 'blue')
From the data, Sales has been selected as the most important feature for
Build the random forest.
set.seed(45)
ranger_ctrl <- trainControl(method = "cv", number = 5, search = 'random', allowParallel = TRUE)
cl <- makeCluster(5)
registerDoParallel(cl)
# Random forest using Ranger
set.seed(456)
store_rf <- train(store_rec, train_data,
method = "ranger",
metric = "Rsquared",
tuneLength = 10,
trControl = ranger_ctrl)
## Loading required namespace: e1071
## Loading required namespace: ranger
stopCluster(cl)
store_rf
## Random Forest
##
## 5996 samples
## 7 predictor
##
## Recipe steps: center, scale, dummy
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4795, 4799, 4796, 4798, 4796
## Resampling results across tuning parameters:
##
## min.node.size mtry splitrule RMSE Rsquared MAE
## 3 5 extratrees 150.4929 0.7359519 33.54315
## 6 20 maxstat 135.1206 0.7238664 16.36335
## 8 9 maxstat 168.1805 0.5874670 28.72476
## 9 17 maxstat 142.9711 0.6899804 18.52979
## 11 5 variance 148.4411 0.7148978 29.22154
## 13 20 maxstat 143.1775 0.6859963 18.21757
## 14 2 extratrees 199.0185 0.4560838 51.04841
## 14 15 extratrees 113.7400 0.8335718 12.55911
## 15 6 variance 141.2858 0.7351351 24.87669
## 19 8 extratrees 141.9624 0.7354749 24.02120
##
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 15, splitrule = extratrees
## and min.node.size = 14.
Using the test data to predict and evaluate the model.
bst_pred <- predict(store_tree, test_data)
postResample(bst_pred, obs = test_data$profit)
## RMSE Rsquared MAE
## 95.5340116 0.8184612 24.9234222
The Boosted tree performed well producing a \(R**2\) of 82% on the test data.
rf_pred <- predict(store_rf, test_data)
postResample(rf_pred, obs = test_data$profit)
## RMSE Rsquared MAE
## 93.5938946 0.8356959 9.8806871
The Random forest performed better than the other model producing a \(R**2\) of 84% on the test data.
Both models can be used to predict profits using the features listed.