This project was done alongside the Tableau assignment in order to practice some R programming with exploratory data analysis. The data set used is the full superstore data containing information on there sales, profit and other metrics of their stores around the United States of America. We want to find out if the discount policy employed is beneficial or detrimental to the sales of items.
Libraries used for our little project. Although as we move the project will add more libraries.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.1
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.2
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggeasy)
## Warning: package 'ggeasy' was built under R version 4.1.3
Read the excel file containing the data that we are going to explore then move to tableau and create a story.
df <- readxl::read_xlsx("Full-Sales-Superstore-Dataset.xlsx")
head(df)
Look at the data more.
str(df)
## tibble [9,994 x 21] (S3: tbl_df/tbl/data.frame)
## $ Category : chr [1:9994] "Office Supplies" "Office Supplies" "Office Supplies" "Office Supplies" ...
## $ City : chr [1:9994] "Houston" "Naperville" "Naperville" "Naperville" ...
## $ Country : chr [1:9994] "United States" "United States" "United States" "United States" ...
## $ Customer Name : chr [1:9994] "Darren Powers" "Phillina Ober" "Phillina Ober" "Phillina Ober" ...
## $ Manufacturer : chr [1:9994] "Message Book" "GBC" "Avery" "SAFCO" ...
## $ Order Date : POSIXct[1:9994], format: "2011-01-04" "2011-01-05" ...
## $ Order ID : chr [1:9994] "CA-2011-103800" "CA-2011-112326" "CA-2011-112326" "CA-2011-112326" ...
## $ Postal Code : num [1:9994] 77095 60540 60540 60540 19143 ...
## $ Product Name : chr [1:9994] "Message Book, Wirebound, Four 5 1/2\" X 4\" Forms/Pg., 200 Dupl. Sets/Book" "GBC Standard Plastic Binding Systems Combs" "Avery 508" "SAFCO Boltless Steel Shelving" ...
## $ Region : chr [1:9994] "Central" "Central" "Central" "Central" ...
## $ Segment : chr [1:9994] "Consumer" "Home Office" "Home Office" "Home Office" ...
## $ Ship Date : POSIXct[1:9994], format: "2011-01-08" "2011-01-09" ...
## $ Ship Mode : chr [1:9994] "Standard Class" "Standard Class" "Standard Class" "Standard Class" ...
## $ State : chr [1:9994] "Texas" "Illinois" "Illinois" "Illinois" ...
## $ Sub-Category : chr [1:9994] "Paper" "Binders" "Labels" "Storage" ...
## $ Discount : num [1:9994] 0.2 0.8 0.2 0.2 0.2 0 0 0 0 0 ...
## $ Number of Records: num [1:9994] 1 1 1 1 1 1 1 1 1 1 ...
## $ Profit : num [1:9994] 6 -5 4 -65 5 5 9 746 1 274 ...
## $ Profit Ratio : num [1:9994] 0.34 -1.55 0.36 -0.24 0.25 0.41 0.48 0.29 0.27 0.45 ...
## $ Quantity : num [1:9994] 2 2 3 3 3 3 3 9 2 2 ...
## $ Sales : num [1:9994] 16 4 12 273 20 ...
Generate a summary of the data.
summary(df)
## Category City Country Customer Name
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Manufacturer Order Date Order ID
## Length:9994 Min. :2011-01-04 00:00:00 Length:9994
## Class :character 1st Qu.:2012-05-23 00:00:00 Class :character
## Mode :character Median :2013-06-27 00:00:00 Mode :character
## Mean :2013-04-30 19:20:02
## 3rd Qu.:2014-05-15 00:00:00
## Max. :2014-12-31 00:00:00
## Postal Code Product Name Region Segment
## Min. : 1040 Length:9994 Length:9994 Length:9994
## 1st Qu.:23223 Class :character Class :character Class :character
## Median :56431 Mode :character Mode :character Mode :character
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Ship Date Ship Mode State
## Min. :2011-01-08 00:00:00 Length:9994 Length:9994
## 1st Qu.:2012-05-27 00:00:00 Class :character Class :character
## Median :2013-06-30 00:00:00 Mode :character Mode :character
## Mean :2013-05-04 18:20:49
## 3rd Qu.:2014-05-19 00:00:00
## Max. :2015-01-06 00:00:00
## Sub-Category Discount Number of Records Profit
## Length:9994 Min. :0.0000 Min. :1 Min. :-6600.00
## Class :character 1st Qu.:0.0000 1st Qu.:1 1st Qu.: 2.00
## Mode :character Median :0.2000 Median :1 Median : 9.00
## Mean :0.1562 Mean :1 Mean : 28.65
## 3rd Qu.:0.2000 3rd Qu.:1 3rd Qu.: 29.00
## Max. :0.8000 Max. :1 Max. : 8400.00
## Profit Ratio Quantity Sales
## Min. :-2.7500 Min. : 1.00 Min. : 0.0
## 1st Qu.: 0.0700 1st Qu.: 2.00 1st Qu.: 17.0
## Median : 0.2700 Median : 3.00 Median : 54.5
## Mean : 0.1205 Mean : 3.79 Mean : 229.9
## 3rd Qu.: 0.3600 3rd Qu.: 5.00 3rd Qu.: 210.0
## Max. : 0.5000 Max. :14.00 Max. :22638.0
After looking at the summary of the data we see that the data has no missing values plus the data type seems to be correct. The only problem may be that the data has mixed case (uppercase or lowercase) and the column headers might need to also be change to make it easier to deal with. Below we shall change the column header to make calling the header columns easier later on.
names(df)
## [1] "Category" "City" "Country"
## [4] "Customer Name" "Manufacturer" "Order Date"
## [7] "Order ID" "Postal Code" "Product Name"
## [10] "Region" "Segment" "Ship Date"
## [13] "Ship Mode" "State" "Sub-Category"
## [16] "Discount" "Number of Records" "Profit"
## [19] "Profit Ratio" "Quantity" "Sales"
Now we implement the changes mentioned above and print the results below the code cell.
names(df) <- tolower(names(df))
names(df) <- gsub(" ","_", names(df),)
names(df) <- sub("-","_", names(df),)
names(df)
## [1] "category" "city" "country"
## [4] "customer_name" "manufacturer" "order_date"
## [7] "order_id" "postal_code" "product_name"
## [10] "region" "segment" "ship_date"
## [13] "ship_mode" "state" "sub_category"
## [16] "discount" "number_of_records" "profit"
## [19] "profit_ratio" "quantity" "sales"
Explore the first 5 rows of the superstore dataset.
head(df)
The questions we are going to address is how the discount affect the sales of sub-category items and the profit ratio.
Create a dataframe of the sub-category item with sales, discount, profit and profit_ratio. View the first 5 rows of the new dataframe.
data_sub <- df %>% select(sub_category, sales, discount, profit, profit_ratio)
head(data_sub)
Look at the total sum of sales for each sub-category plus the profit ratio.
data_sub %>% group_by(sub_category) %>% summarise(Total_sales = sum(sales), Total_profit_ratio = sum(profit_ratio),Total_discount = sum(discount), Total_profit = sum(profit)) %>% arrange(desc(Total_sales))
Will have to change some of the data types to factors so as to simplify the visualization done below.
data_sub$sub_category <- as.factor(data_sub$sub_category)
Visualize the total sales from largest to smallest.
data_sub %>% group_by(sub_category) %>% summarise(Total_Sales = sum(sales)) %>% ggplot(aes(x = Total_Sales, y = reorder(sub_category, Total_Sales ), fill = Total_Sales)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of sales for each sub-category',y = 'Sales', title = 'Bar Chart of Sub-category') + theme_classic() + theme(legend.position = 'none', plot.title = element_text(size = 15), axis.text.x = element_blank()) + geom_text(aes(label = prettyNum(Total_Sales, big.mark = ",")), size = 3, hjust = -.1, vjust = .1) + scale_x_continuous(limits = c(0,400000)) + scale_fill_gradient(low = "light blue", high = "blue")
The bar chart above shows that Phones are making the highest sales and Fasteners are in last place.
Next we look at the data in terms of profit.
data_sub %>% group_by(sub_category) %>% summarise(Total_Profit = sum(profit)) %>% ggplot(aes(x = Total_Profit, y = reorder(sub_category, Total_Profit ), fill = Total_Profit)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of profits for each sub-category',y = 'Profits', title = 'Bar Chart of Sub-category and Profits')+ theme_classic() + theme(legend.position = 'none', plot.title = element_text(size = 15)) + scale_fill_gradient(low = "red" ,high = "green")
From the above, in terms of profit, copiers generate the most profit and tables actually lose money. We shall examine the data from the profit ratio \(sum(profit)/sum(sales)\).
data_sub %>% group_by(sub_category) %>% summarise(Total_Profit_Ratio = sum(profit_ratio)) %>% ggplot(aes(x = Total_Profit_Ratio, y = reorder(sub_category, Total_Profit_Ratio ), fill = Total_Profit_Ratio)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of Profit ratio for each sub-category',y = 'Profit ratio', title = 'Bar Chart of Sub-category and Profit ratio') + theme_classic() + theme(legend.position = 'none', plot.title = element_text(size = 15)) + scale_fill_gradient(low = "red", high = "green")
The above shows the profit ratio of the various sub-category items. Paper leads with the highest profit ratio and blinders have the lowest profit ratio.
From the above, we shall make the sales bar graph with the profit ratio to show how the sub-category is doing in terms of profit.
data_sub %>% group_by(sub_category) %>% summarise(Total_Sales = sum(sales), Total_Profit_Ratio = sum(profit_ratio)) %>% ggplot(aes(x = Total_Sales, y = reorder(sub_category, Total_Sales ), fill = Total_Profit_Ratio)) + geom_col() + labs(x = 'Sub-Category', subtitle = 'The sum of sales for each sub-category with the profit ratio',y = 'Sales', title = 'Bar Chart of Sub-category') + theme_classic() + theme(plot.title = element_text(size = 15), axis.text.x = element_blank()) + geom_text(aes(label = prettyNum(Total_Sales, big.mark = ",")), size = 3, hjust = -.1, vjust = .1) + scale_x_continuous(limits = c(0,400000)) + scale_fill_gradient(low = "light blue", high = "blue") + easy_add_legend_title("Profit ratio")
From the above, paper which has a high profit ratio in terms of sales ranks low compared to binders that has a low profit ratio.
Let see how much discount binders has compared to the other sub-categories. We shall use a pie chart to illustrate this.
data_sub %>% group_by(sub_category) %>% summarise(Percent_discount = (sum(discount)/sum(data_sub$discount)*100)) %>% ggplot(aes(x = "", y = Percent_discount, fill = sub_category)) + geom_bar(position = "fill", stat="identity", color = "white") + coord_polar("y", start=0) + labs(title = "Pie chart of the Discount") + theme_void() + easy_add_legend_title("Sub category")
The pie chart clearly shows the discount of binders is greater than the rest. This is the reason that the profit ratio is the lowest from the bar chart.
After this we shall use a correlation matrix and scatter plot to identify any correlation that may exist between discount and the other attributes that we are experimenting on. An important note to mention is that correlation does not imply causation.
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
num_data <- data_sub %>% select(-sub_category)
m <- cor(num_data)
corrplot(m, method = 'number', col = COL2('RdYlBu'), title = "Correlation matrix")
From the corrplot, we can see that there is a strong negative correlation between Profit ratio and discount (-0.86). We also see that there is a weak strong correlation between profit and sales (0.48) and a weak negative correlation between profit and discount.
Now we shall construct a scatterplot of sum of the profit ratio and discount to get a more indepth insight.
data_sub %>% group_by(sub_category) %>% summarize(Total_Profit_Ratio = sum(profit_ratio), Total_discount = sum(discount), Total_sales = sum(sales)) %>% ggplot() + geom_point(aes(y = Total_Profit_Ratio, x = Total_discount, color = sub_category, size = Total_sales)) + theme_classic() + labs(title = "Scatterplot of Discount and Profit ratio", y = "Profit ratio", x = "Discount", subtitle = "Comparing Discount and Profit ratio of different sub-category sales", color = "black") + easy_add_legend_title("Sub-category") + scale_size(guide = "none")
From the scatterplot, we can see that for binders the discount is high therefore the profit ratio is negative. The company may need to re-evaluate the discount for binders.
Blinders seem to be doing good for sales but are highly discounted. The store may require to remove the discount on blinder to generate more profit.
We want to use random tree regression to predict our profit and also gain insight on what features are important for the profit attribute.
From the whole dataset, I will remove meaningless attributes such as customer name that will not be used by the algorithm.
regress <- df %>% select(-customer_name, -country, -postal_code, -city, -product_name, -number_of_records, -order_id, -segment, -ship_mode)
head(regress)
Split the data into test and train datasets.
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.1.1
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(25)
partition <- caret::createDataPartition(regress$profit, p = 0.75, list = FALSE)
data_train <- regress[partition,]
data_test <- regress[-partition,]
After splitting the data, we use the random forest algorithm to run a regression model on the data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(38)
rf.fit <- randomForest(profit ~ ., data = data_train, importance = TRUE)
rf.fit
##
## Call:
## randomForest(formula = profit ~ ., data = data_train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 12659.06
## % Var explained: 78.33
Display a graph of mean squared error based on the number of trees used.
plot(rf.fit)
Now we attempt to evaluate the random forest algorithm. First we find the smallest mean squared error produced by the algorithm.
which.min(rf.fit$mse)
## [1] 169
sqrt(rf.fit$mse[which.min(rf.fit$mse)])
## [1] 108.3401
Construct a plot showing the importance of the predictor variable.
varImpPlot(rf.fit)
Show the figures used to construct the figure above.
rfvar <- varImp(rf.fit, scale = TRUE) %>% arrange(desc(Overall))
rfvar