Introduction

This project was done alongside the Tableau assignment in order to practice some R programming with exploratory data analysis. The data set used is the full superstore data containing information on there sales, profit and other metrics of their stores around the United States of America. We want to find out if the discount policy employed is beneficial or detrimental to the sales of items.

Import pacakges for the assignment

Libraries used for our little project. Although as we move the project will add more libraries.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.1

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.1.2

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggeasy)

## Warning: package 'ggeasy' was built under R version 4.1.3

Extract, transform and load

Read the excel file containing the data that we are going to explore then move to tableau and create a story.

df <- readxl::read_xlsx("Full-Sales-Superstore-Dataset.xlsx")
head(df)

Look at the data more.

str(df)

## tibble [9,994 x 21] (S3: tbl_df/tbl/data.frame)
##  $ Category         : chr [1:9994] "Office Supplies" "Office Supplies" "Office Supplies" "Office Supplies" ...
##  $ City             : chr [1:9994] "Houston" "Naperville" "Naperville" "Naperville" ...
##  $ Country          : chr [1:9994] "United States" "United States" "United States" "United States" ...
##  $ Customer Name    : chr [1:9994] "Darren Powers" "Phillina Ober" "Phillina Ober" "Phillina Ober" ...
##  $ Manufacturer     : chr [1:9994] "Message Book" "GBC" "Avery" "SAFCO" ...
##  $ Order Date       : POSIXct[1:9994], format: "2011-01-04" "2011-01-05" ...
##  $ Order ID         : chr [1:9994] "CA-2011-103800" "CA-2011-112326" "CA-2011-112326" "CA-2011-112326" ...
##  $ Postal Code      : num [1:9994] 77095 60540 60540 60540 19143 ...
##  $ Product Name     : chr [1:9994] "Message Book, Wirebound, Four 5 1/2\" X 4\" Forms/Pg., 200 Dupl. Sets/Book" "GBC Standard Plastic Binding Systems Combs" "Avery 508" "SAFCO Boltless Steel Shelving" ...
##  $ Region           : chr [1:9994] "Central" "Central" "Central" "Central" ...
##  $ Segment          : chr [1:9994] "Consumer" "Home Office" "Home Office" "Home Office" ...
##  $ Ship Date        : POSIXct[1:9994], format: "2011-01-08" "2011-01-09" ...
##  $ Ship Mode        : chr [1:9994] "Standard Class" "Standard Class" "Standard Class" "Standard Class" ...
##  $ State            : chr [1:9994] "Texas" "Illinois" "Illinois" "Illinois" ...
##  $ Sub-Category     : chr [1:9994] "Paper" "Binders" "Labels" "Storage" ...
##  $ Discount         : num [1:9994] 0.2 0.8 0.2 0.2 0.2 0 0 0 0 0 ...
##  $ Number of Records: num [1:9994] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Profit           : num [1:9994] 6 -5 4 -65 5 5 9 746 1 274 ...
##  $ Profit Ratio     : num [1:9994] 0.34 -1.55 0.36 -0.24 0.25 0.41 0.48 0.29 0.27 0.45 ...
##  $ Quantity         : num [1:9994] 2 2 3 3 3 3 3 9 2 2 ...
##  $ Sales            : num [1:9994] 16 4 12 273 20 ...

Generate a summary of the data.

summary(df)

##    Category             City             Country          Customer Name     
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Manufacturer         Order Date                    Order ID        
##  Length:9994        Min.   :2011-01-04 00:00:00   Length:9994       
##  Class :character   1st Qu.:2012-05-23 00:00:00   Class :character  
##  Mode  :character   Median :2013-06-27 00:00:00   Mode  :character  
##                     Mean   :2013-04-30 19:20:02                     
##                     3rd Qu.:2014-05-15 00:00:00                     
##                     Max.   :2014-12-31 00:00:00                     
##   Postal Code    Product Name          Region            Segment         
##  Min.   : 1040   Length:9994        Length:9994        Length:9994       
##  1st Qu.:23223   Class :character   Class :character   Class :character  
##  Median :56431   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :55190                                                           
##  3rd Qu.:90008                                                           
##  Max.   :99301                                                           
##    Ship Date                    Ship Mode            State          
##  Min.   :2011-01-08 00:00:00   Length:9994        Length:9994       
##  1st Qu.:2012-05-27 00:00:00   Class :character   Class :character  
##  Median :2013-06-30 00:00:00   Mode  :character   Mode  :character  
##  Mean   :2013-05-04 18:20:49                                        
##  3rd Qu.:2014-05-19 00:00:00                                        
##  Max.   :2015-01-06 00:00:00                                        
##  Sub-Category          Discount      Number of Records     Profit        
##  Length:9994        Min.   :0.0000   Min.   :1         Min.   :-6600.00  
##  Class :character   1st Qu.:0.0000   1st Qu.:1         1st Qu.:    2.00  
##  Mode  :character   Median :0.2000   Median :1         Median :    9.00  
##                     Mean   :0.1562   Mean   :1         Mean   :   28.65  
##                     3rd Qu.:0.2000   3rd Qu.:1         3rd Qu.:   29.00  
##                     Max.   :0.8000   Max.   :1         Max.   : 8400.00  
##   Profit Ratio        Quantity         Sales        
##  Min.   :-2.7500   Min.   : 1.00   Min.   :    0.0  
##  1st Qu.: 0.0700   1st Qu.: 2.00   1st Qu.:   17.0  
##  Median : 0.2700   Median : 3.00   Median :   54.5  
##  Mean   : 0.1205   Mean   : 3.79   Mean   :  229.9  
##  3rd Qu.: 0.3600   3rd Qu.: 5.00   3rd Qu.:  210.0  
##  Max.   : 0.5000   Max.   :14.00   Max.   :22638.0

After looking at the summary of the data we see that the data has no missing values plus the data type seems to be correct. The only problem may be that the data has mixed case (uppercase or lowercase) and the column headers might need to also be change to make it easier to deal with. Below we shall change the column header to make calling the header columns easier later on.

names(df)

##  [1] "Category"          "City"              "Country"          
##  [4] "Customer Name"     "Manufacturer"      "Order Date"       
##  [7] "Order ID"          "Postal Code"       "Product Name"     
## [10] "Region"            "Segment"           "Ship Date"        
## [13] "Ship Mode"         "State"             "Sub-Category"     
## [16] "Discount"          "Number of Records" "Profit"           
## [19] "Profit Ratio"      "Quantity"          "Sales"

Now we implement the changes mentioned above and print the results below the code cell.

names(df) <- tolower(names(df))
names(df) <- gsub(" ","_", names(df),)
names(df) <- sub("-","_", names(df),)
names(df)

##  [1] "category"          "city"              "country"          
##  [4] "customer_name"     "manufacturer"      "order_date"       
##  [7] "order_id"          "postal_code"       "product_name"     
## [10] "region"            "segment"           "ship_date"        
## [13] "ship_mode"         "state"             "sub_category"     
## [16] "discount"          "number_of_records" "profit"           
## [19] "profit_ratio"      "quantity"          "sales"

Explore the first 5 rows of the superstore dataset.

head(df)

Exploratory analysis

The questions we are going to address is how the discount affect the sales of sub-category items and the profit ratio.

Create a dataframe of the sub-category item with sales, discount, profit and profit_ratio. View the first 5 rows of the new dataframe.

data_sub <- df %>% select(sub_category, sales, discount, profit, profit_ratio)
head(data_sub)

Look at the total sum of sales for each sub-category plus the profit ratio.

data_sub %>% group_by(sub_category) %>% summarise(Total_sales = sum(sales), Total_profit_ratio = sum(profit_ratio),Total_discount = sum(discount), Total_profit = sum(profit)) %>% arrange(desc(Total_sales))

Will have to change some of the data types to factors so as to simplify the visualization done below.

data_sub$sub_category <- as.factor(data_sub$sub_category)

Visualize the total sales from largest to smallest.

data_sub %>% group_by(sub_category) %>% summarise(Total_Sales = sum(sales)) %>% ggplot(aes(x = Total_Sales, y = reorder(sub_category, Total_Sales ), fill = Total_Sales)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of sales for each sub-category',y = 'Sales', title = 'Bar Chart of Sub-category') + theme_classic() + theme(legend.position = 'none', plot.title = element_text(size = 15), axis.text.x = element_blank()) + geom_text(aes(label = prettyNum(Total_Sales, big.mark = ",")), size = 3, hjust = -.1, vjust = .1) + scale_x_continuous(limits = c(0,400000)) + scale_fill_gradient(low = "light blue", high = "blue")

The bar chart above shows that Phones are making the highest sales and Fasteners are in last place.

Next we look at the data in terms of profit.

data_sub %>% group_by(sub_category) %>% summarise(Total_Profit = sum(profit)) %>% ggplot(aes(x = Total_Profit, y = reorder(sub_category, Total_Profit ), fill = Total_Profit)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of profits for each sub-category',y = 'Profits', title = 'Bar Chart of Sub-category and Profits')+ theme_classic() + theme(legend.position = 'none', plot.title = element_text(size = 15)) + scale_fill_gradient(low = "red" ,high = "green")

From the above, in terms of profit, copiers generate the most profit and tables actually lose money. We shall examine the data from the profit ratio \(sum(profit)/sum(sales)\).

data_sub %>% group_by(sub_category) %>% summarise(Total_Profit_Ratio = sum(profit_ratio)) %>% ggplot(aes(x = Total_Profit_Ratio, y = reorder(sub_category, Total_Profit_Ratio ), fill = Total_Profit_Ratio)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of Profit ratio for each sub-category',y = 'Profit ratio', title = 'Bar Chart of Sub-category and Profit ratio') + theme_classic() + theme(legend.position = 'none', plot.title = element_text(size = 15)) + scale_fill_gradient(low = "red", high = "green")

The above shows the profit ratio of the various sub-category items. Paper leads with the highest profit ratio and blinders have the lowest profit ratio.

From the above, we shall make the sales bar graph with the profit ratio to show how the sub-category is doing in terms of profit.

data_sub %>% group_by(sub_category) %>% summarise(Total_Sales = sum(sales), Total_Profit_Ratio = sum(profit_ratio)) %>% ggplot(aes(x = Total_Sales, y = reorder(sub_category, Total_Sales ), fill = Total_Profit_Ratio)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of sales for each sub-category with the profit ratio',y = 'Sales', title = 'Bar Chart of Sub-category') + theme_classic() + theme(plot.title = element_text(size = 15), axis.text.x = element_blank()) + geom_text(aes(label = prettyNum(Total_Sales, big.mark = ",")), size = 3, hjust = -.1, vjust = .1) + scale_x_continuous(limits = c(0,400000)) + scale_fill_gradient(low = "light blue", high = "blue") + easy_add_legend_title("Profit ratio")

From the above, paper which has a high profit ratio in terms of sales ranks low compared to binders that has a low profit ratio.

Let see how much discount binders has compared to the other sub-categories. We shall use a pie chart to illustrate this.

 data_sub %>% group_by(sub_category) %>% summarise(Percent_discount = (sum(discount)/sum(data_sub$discount)*100)) %>% ggplot(aes(x = "", y = Percent_discount, fill = sub_category)) + geom_bar(position = "fill", stat="identity", color = "white") + coord_polar("y", start=0) + labs(title = "Pie chart of the Discount") + theme_void() + easy_add_legend_title("Sub category")

The pie chart clearly shows the discount of binders is greater than the rest. This is the reason that the profit ratio is the lowest from the bar chart.

After this we shall use a correlation matrix and scatter plot to identify any correlation that may exist between discount and the other attributes that we are experimenting on. An important note to mention is that correlation does not imply causation.

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.3

## corrplot 0.92 loaded

num_data <- data_sub %>% select(-sub_category)
m <- cor(num_data)
corrplot(m, method = 'number', col = COL2('RdYlBu'), title = "Correlation matrix")

From the corrplot, we can see that there is a strong negative correlation between Profit ratio and discount (-0.86). We also see that there is a weak strong correlation between profit and sales (0.48) and a weak negative correlation between profit and discount.

Now we shall construct a scatterplot of sum of the profit ratio and discount to get a more indepth insight.

data_sub %>% group_by(sub_category) %>% summarize(Total_Profit_Ratio = sum(profit_ratio), Total_discount = sum(discount), Total_sales = sum(sales)) %>% ggplot() + geom_point(aes(y = Total_Profit_Ratio, x = Total_discount, color = sub_category, size = Total_sales)) + theme_classic() + labs(title = "Scatterplot of Discount and Profit ratio", y = "Profit ratio", x = "Discount", subtitle = "Comparing Discount and Profit ratio of different sub-category sales", color = "black") + easy_add_legend_title("Sub-category") + scale_size(guide = "none")

From the scatterplot, we can see that for binders the discount is high therefore the profit ratio is negative. The company may need to re-evaluate the discount for binders.

Conclusion

Blinders seem to be doing good for sales but are highly discounted. The store may require to remove the discount on blinder to generate more profit.

Machine learning

We want to use random tree regression to predict our profit and also gain insight on what features are important for the profit attribute.

From the whole dataset, I will remove meaningless attributes such as customer name that will not be used by the algorithm.

regress <- df %>% select(-customer_name, -country, -postal_code, -city, -product_name, -number_of_records, -order_id, -segment, -ship_mode)
head(regress)

Split the data into test and train datasets.

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.1.1

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(25)
partition <- caret::createDataPartition(regress$profit, p = 0.75, list = FALSE)
data_train <- regress[partition,]
data_test <- regress[-partition,]

After splitting the data, we use the random forest algorithm to run a regression model on the data.

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.1.3

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(38)
rf.fit <- randomForest(profit ~ ., data = data_train, importance = TRUE)
rf.fit

## 
## Call:
##  randomForest(formula = profit ~ ., data = data_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 12659.06
##                     % Var explained: 78.33

Display a graph of mean squared error based on the number of trees used.

plot(rf.fit)

Now we attempt to evaluate the random forest algorithm. First we find the smallest mean squared error produced by the algorithm.

which.min(rf.fit$mse)

## [1] 169

sqrt(rf.fit$mse[which.min(rf.fit$mse)])

## [1] 108.3401

Construct a plot showing the importance of the predictor variable.

varImpPlot(rf.fit)

Show the figures used to construct the figure above.