Name: Karolina Grodzinska
Class: ALY6000 Introduction to Data Analytics
INTRODUCTION
This project will be based on the analysis of sales data. It is said that “analytics is delivering significant improvements in growth, efficiency, and effectiveness” and it has been playing increasingly important role in sales.
There are 4 primary areas where analytics are deployed:
Sales companies are using historical data to get a better overview of their current customers but also to be able to find prospective customers with higher accuracy. Big data allows for better segmentation of clients, as well as, to build recommendation systems. There are also models that can predict churn. Other uses of data include resource allocation, sales planning, tracking of shipments. In addition, sales companies use voice recognition and natural language processing for customer support (Atkins et al., 2016).
Thus, the use of analytics can be very helpful and can prevent losses by having real-time access to the necessary information and being able to predict any shortages of components, or being able to offer special promotions when some products are not being bought.
The dataset used in this project consists of 27 variables and 1000 observations. The variables included in this dataset are both qualitative and quantitative. Ten of them are numerical and consist of business metrics - sales volumes, losses, profits. Other numerical variables contain information about the products - their prices, quantities sold, quantities of returned products, shipping prices, discounts etc.
Many of the remaining variables are categorical and they contain information about the customers (their names and unique IDs) and geographical data about where the customers are located. There are also categorical variables regarding products and orders. Chosen variables will be used in the analysis section to find insightful information about the company this data has been collected about. There are also two variables that represent dates of orders and shipments.
PROBLEM FORMULATION
A graphical representation of data can be a powerful tool to make sense of data and various charts are said to “serve as a shortcut for the human mind to convert data into knowledge.” (Jones, 2020). Therefore, many tasks will be focused on presenting the data visually.
There are many types of charts that can represent data and different types of data are shown better using a certain type of chart. Choosing a wrong visualization can lead to a wrong interpretation of data (Oetting). This project will mostly focus on bar charts and pie charts for discrete categorical data (counting the numbers of orders per category) and I’ll be using box plots and histograms for numerical variables. I wanted to try using other plotting methods as well, so as dot plots will be used in task 4 to present statistics for chosen subcategories.
This analysis will focus on top-down view of the sales data, as seen by the manager. This means that used variables will be related to financial metrics - profits, losses, total sales, instead of drilling down on individual customers or their orders, discounts, etc. I will be using exploratory data analysis and with the help of descriptive statistics and graphs, I will draw insights about the most profitable segments (both customer segments and groups of products). I will also explore the best months and days of the week for this business.
ANALYSIS SECTION
Task 1
Presenting basic descriptive statistics of 5 numerical variables using a table.
# Creating the variables
sales_mean = round(mean(df$Sales_Total),2)
sales_median = round(median(df$Sales_Total),2)
sales_range = round((max(df$Sales_Total) - min(df$Sales_Total)),2)
sales_sd = round(sd(df$Sales_Total),2)
net_mean = round(mean(df$Net_Sale),2)
net_median = round(median(df$Net_Sale),2)
net_range = round((max(df$Net_Sale) - min(df$Net_Sale)),2)
net_sd = round(sd(df$Net_Sale),2)
loss_mean = round(mean(df$LossPerReturn),2)
loss_median = round(median(df$LossPerReturn),2)
loss_range = round((max(df$LossPerReturn) - min(df$LossPerReturn)),2)
loss_sd = round(sd(df$LossPerReturn),2)
total_loss_mean = round(mean(df$Total_loss),2)
total_loss_median = round(median(df$Total_loss),2)
total_loss_range = round((max(df$Total_loss) - min(df$Total_loss)),2)
total_loss_sd = round(sd(df$Total_loss),2)
profit_mean = round(mean(df$Profits),2)
profit_median = round(median(df$Profits),2)
profit_range = round((max(df$Profits) - min(df$Profits)),2)
profit_sd = round(sd(df$Profits),2)
# Creating the table
Variable = c("Sales Total", "Net Sales", "Loss per return", "Total Loss", "Profits")
Mean = c(sales_mean, net_mean, loss_mean, total_loss_mean, profit_mean)
sd = c(sales_sd, net_sd, loss_sd, total_loss_sd, profit_sd)
Range = c(sales_range, net_range, loss_range, total_loss_range, profit_range)
Median = c(sales_median, net_median, loss_median, total_loss_median, profit_median)
stats = data.frame(Variable, Mean, sd, Range, Median,
stringsAsFactors=FALSE)
# Showing the table
knitr::kable(stats)
| Variable | Mean | sd | Range | Median |
|---|---|---|---|---|
| Sales Total | 3107.40 | 2574.06 | 28934.83 | 2481.89 |
| Net Sales | 2283.54 | 1691.58 | 17448.71 | 1915.69 |
| Loss per return | 592.28 | 317.46 | 3678.78 | 540.90 |
| Total Loss | 823.86 | 1007.93 | 11527.92 | 578.32 |
| Profits | 1066.54 | 901.22 | 8799.68 | 834.35 |
This table shows the basic descriptive statistics for 5 numerical variables related to sales. Mean and median are the most commonly used measures of central tendency. Mean (also called an average) is measured by summing up all values and dividing the sum by the total number of observations. Median is the middle value (or the average of the two middle values) in the dataset, when we order all the values. When the mean, median, and mode are equal, the distribution of the variable is normal but when there are differences, the distribution is positively or negatively skewed (Bluman, 2018). We will see the distributions of chosen variables (shown on histograms) in the next section.
In statistics, it is necessary not to only look at the measures of central tendency to describe a dataset accurately. Thus, also measuers of variation are included in the table - range (the difference between maximum and minimum values), as well as, standard deviation (Bluman, 2018).
It is interesting to see that the losses per return have such a big range, meaning that some customers returned big orders. It would definitely be interesting to look into that and try to find more data within the firm to see what was the cause and whether returns of that size could be prevented.
Task 2
Creating box plots and histograms of numerical variables
While describing data, it is important to look at the shape of its distribution. A good way of showing distribution in a graphical way is drawing a histogram, which shows frequencies of data points and organizes them in the number of bins specified by the analyst (Bluman, 2018). In the task below, 4 numerical variables will be shown as histograms with different numbers of bins (chosen based on the best representation of data). Another graph used to summarize each of the variables will be a boxplot.
Box plots represent the lowest and highest values, the first and third quartile, and the median. The name comes from the box that is drawn between the first and third quartile on the chart (Bluman, 2018).
par(mfrow=c(2,1)) # showing one graph under the other
par(mar=c(2,2,2,2)) # changes margins within the figure
boxplot(df$Profits,
horizontal = TRUE,
col = "red3")
hist(df$Profits,
breaks = 20,
col = brewer.pal(11, "Spectral"),
main = "Distribution of profits")
Firstly, let’s take a look at profits. It seems like this variable is positively skewed (the majority of the values are placed on the left from the mean and are clustered in the lower end of the distribution) and there are many outliers with values higher than 1.5 * IQR above the 3rd quartile (Bluman, 2018).
The minimum value of profit is -158.13, and profits for most orders do not exceed $2000.
par(mfrow=c(2,1)) # showing one graph under the other
par(mar=c(2,2,2,2)) # changes margins within the figure
boxplot(df$Sales_Total,
horizontal = TRUE,
col = "gold")
hist(df$Sales_Total,
breaks = 25,
col = terrain.colors(10),
main = "Distribution of the variable Total Sales")
All values for total sales are positive and most do not exceed $5000. There are many outliers and similarly to figure 1, figure 2 also shows a positively skewed distribution. As it is already known from table 1, the median value for total sales is equal to 2481.89. From the box plot it appears that the data points are more spread out in the 3rd quartile.
par(mfrow=c(2,1)) # showing one graph under the other
par(mar=c(2,2,2,2)) # changes margins within the figure
boxplot(df$Total_loss,
horizontal = TRUE,
col = "aquamarine3")
hist(df$Total_loss,
breaks = 20,
col = brewer.pal(9, "YlGnBu"),
main = "Distribution of the variable Total Losses")
q75 = quantile(df$Total_loss, c(.75))
The values for total losses range between 0, and $11527.92. However, most of the values are below 1103.9193, and in case of total losses, the data points are slightly more spread out in the first quartile than within the third.
This shape of the distribution is called “Reverse J-shaped”, meaning that there are very few values on the right side of the plot and most values are within the first bin (Bluman, 2018).
par(mfrow=c(2,1)) # showing one graph under the other
par(mar=c(2,2,2,2)) # changes margins within the figure
boxplot(df$LossPerReturn,
horizontal = TRUE,
col = "blueviolet")
hist(df$LossPerReturn,
breaks = 15,
col = brewer.pal(9, "BuPu"),
main = "Distribution of losses per return")
The histogram from figure 4, just like figures 1 & 2, shows a right-skewed distribution. The median value for loss per return is equal to $540.90 and the highest value of this variable is equal to 3842.64. Looking at the dataset, it seems that it corresponds to an order of 11 Samsung Phones. Unfortunately, this dataset does not provide reasons of returns.
Task 3
I have used the function UNIQUE() in Excel to identify categorical variables with smaller number of categories, so my visualizations for this task would not be too cluttered.
# Creating a bar plot
a = barplot(table(df$Market),
main = "Number of orders across different markets",
ylab = "Frequency",
col = brewer.pal(5, "YlGnBu"))
It seems that the most orders are for the Asian-Pacific market. Africa has the smallest number of orders. It would be interesting to see summaries of profits for each of these markets in order to understand whether higher numbers of orders correspond to higher profits or there are some other factors such as returns that caused losses. Another possibility would be having many orders but of a low value, which could also affect the profits.
profits <- df %>%
select(c("Market","Profits")) %>%
group_by(Market) %>%
summarise_each(funs(sum))
knitr::kable(profits)
| Market | Profits |
|---|---|
| Africa | 57942.19 |
| Asia Pacific | 352524.78 |
| Europe | 333985.99 |
| LATAM | 216745.95 |
| USCA | 105346.07 |
It seems that the values of profits correspond to the numbers of orders shown in table 2. Thus, there is no need to explore additional trends related to the markets.
I wanted to explore how fast the analyzed company has been able to send their products. In order to do that, I needed to change the data type of the two variables showing dates to a date type recognized by R and I created a new variable called ‘days’ calculating the difference between the shipment date and order date. I used the article by R Bloggers (2013) to learn about the date types in R. This transformation would allow me to see whether the logistics department has been able to send their products fast to the customers.
# Changing the variables from 'character' type to dates
df$OrderDate = as.Date(df$OrderDate, format = "%m/%d/%y")
df$ShipDate = as.Date(df$ShipDate, format = "%m/%d/%y")
# Creating a new column
df$days = df$ShipDate - df$OrderDate
# Visualizing the data
days = table(df$days)
p = barplot(days,
main = "Number of days it takes to ship a product",
xlab = "Days",
ylab = "Frequency",
col = c("darkslategray1","deepskyblue","dodgerblue2"))
text(y = days,
p,
days,
cex=0.8,
pos = 1)
It turns out that figure 6 doesn’t show any interesting insights, as it seems that all the products ship between 2-4 days after they are ordered. However, having variables in date format can be interesting to discover other trends in this dataset.
# Creating additional variables for a weekday and a month
df$month = format(df$OrderDate,"%B")
df$weekday = format(df$OrderDate,"%a")
weekday = factor(df$weekday, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# Let's see on which day of the week customers order their products
x = barplot(table(weekday), main = "On which days do customers create most orders?",
col = c(terrain.colors(7)),
xlab = "Day",
ylab = "Number of orders")
text(y = table(weekday),
x,
table(weekday),
cex=0.8,
pos = 1)
To create figure 7, I needed to change data type of weekdays to a factor, otherwise the bars would not be in an order I wanted them to be (currently they are shown as in the calendar - from Monday to Sunday). It seems that the most orders have been created on Tuesdays, so it could be a valuable information for the marketing department when they plan any promotions.
The next task was to create 2 pie charts for other categorical variables - I have chosen to visualize departments and customer segments, as both of those variables include only 3 categories and using pie charts with many slices is very confusing for the reader, as it becomes increasingly difficult to compare the angles (especially if the values are similar).
# table(df$Department)
pie(table(df$Department),
labels = unique(df$Department),
col = c("mediumpurple4","darkorchid2","mediumpurple1"),
main = "Number of orders by department")
Any data can be represented using graphical marks (for example lines, dots, bars, or other shapes) and their attributes, such as colors, sizes, positions, etc. However, sometimes chosen attributes fail to express the information shown in the data (Jones, 2020).
Figure 8 definitely doesn’t show the differences between various categories very well since the corresponding numbers are not that different (305 orders for furniture, 336 for office supplies, and 355 for technology department).
It is difficult for the human mind to compare areas. On the other hand, bar charts are relatively easy to interpret as we are able to compare the length and even guess how much longer one bar would be than the other, even without having any numerical labels (Jones, 2020).
pie(table(df$Segment),
labels = table(df$Segment),
col = c("paleturquoise1","steelblue1","lightslateblue"),
main = "Number of orders by customer segment")
legend("topleft", legend = paste(unique(df$Segment)),
fill = c("paleturquoise1","steelblue1","lightslateblue"))
This pie chart (shown on figure 9) definitely looks better than the previous graph, as it shows both the values corresponding to each category and their names shown as a legend. The differences between segments are also more significant as anyone can see that the “Consumer” segment takes up more than half of the orders.
Task 4
In this task I will be using one numerical and one categorical variable for each graph to gain a deeper understanding of sales across categories. I want to present the statistics calculated per each category as dot plots.
Dot plots are statistical graphs showing each data value as a point bove the x axis (Bluman, 2018).
# Let's see the total sales per department
dept_sales = tapply(df$Sales_Total, df$Department, sum)
dotchart(dept_sales, pch = 24, bg = "blue", cex = 1.3,
xlab="Sum of sales",
main = "What are the total sales \nfor each department?")
As shown on the figure 10, the technology department has the highest value of sales in the company and office supplies the lowest - it is most likely caused by the fact that the prices of technological goods are much higher than prices of the office supplies, so even with lower volumes of sales, it would be easier to earn more money selling tech.
# Now average losses per customer segment
segm_loss = tapply(df$Total_loss, df$Segment, mean)
dotchart(segm_loss, pch = 21, bg = "red2", cex = 1.3,
xlab="Average losses",
main = "What is the average loss \nfor each customer segment?")
Firstly, it is necessary to note that because the axis do not start from zero it seems like the difference between the segments is big, while in fact, it is just about $100. The “Home Office” segment of customers noted the highest average losses, which means that customers belonging to this category returned more products than customers from other segments.
# The last plot will show a median profit per each day of the week
week_profit = tapply(df$Profits, weekday, median)
dotchart(week_profit, pch = 22, bg = "purple1", cex = 1.3,
xlab="Median value of profits",
main = "What is the median profit \nfor each day of the week?")
Even though the most orders have been created on Tuesdays (as previously shown on figure 7), it seems that the median value of profits is the highest for Saturday and Tuesday has actually the second lowest value. There might be more than one explanation for this, depending on how the data is being collected. If profits are registered immediately, at the end of each day, that means that customers create more orders on Tuesdays but of lower value. But if the payment takes a few days or the profit is being calculated after the product has been shipped, the high profits for Saturdays might be caused by orders created a few days earlier, as we know that shipping takes between 2-4 days.
Task 5
Filtering data to see what are the total profits in Europe
As I have lived in Europe for the most of my life, I know this region the best and it has the 2nd highest profits as shown in table 2. Thus, I’d like to focus on Europe in this task.
# Filtering data to create a new table about Europe
Europe = filter(df, Market == "Europe")
EU_profits <- Europe %>%
select(c("Region","Profits")) %>%
group_by(Region) %>%
summarise_each(funs(sum))
knitr::kable(EU_profits)
| Region | Profits |
|---|---|
| Eastern Europe | 29226.86 |
| Northern Europe | 59009.73 |
| Southern Europe | 61970.46 |
| Western Europe | 183778.93 |
It seems that Western Europe noted the highest profits (3 times higher than the 2nd region) and Eastern Europe has the lowest profits. It is most likely related to how wealthy each region is - Eastern Europe has much lower salaries than the Northern or Western Europe, so especially non-business customers will have lower disposable income and won’t spend as much on furniture or gadgets.
# Creating a horizontal bar chart
ggplot(EU_profits, aes(x = Region, y = Profits)) +
geom_bar(stat = "identity", fill = "#00abff") +
coord_flip() +
labs(title="Profits across different regions in Europe")
Figure 13 represents the same information as the table 3 in a visual way - now it is easier to see how much higher the profits are for Western Europe than for any other region. Southern and Northern Europe have similar values of profits.
Task 6
Additional insights about the sales data
This section included an open question that could be answered using any variables or using any methods we have learned during this course.
As the last part of this analysis, I created a table with chosen numerical variables, which I also needed to order in chronological order:
# Creating a new dataframe to show summarized values per each month
monthly <- df %>%
select(c(19,22,23,25,27,29)) %>%
group_by(month) %>%
summarise_each(funs(sum))
monthly$month = factor(monthly$month, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
monthly = monthly[order(monthly$month, decreasing = FALSE),]
knitr::kable(monthly)
| month | Quantity | Sales_Total | Returns | Total_loss | Profits |
|---|---|---|---|---|---|
| January | 380 | 167525.9 | 62 | 37827.40 | 63372.22 |
| February | 256 | 123401.8 | 48 | 32162.18 | 41509.89 |
| March | 324 | 140635.8 | 54 | 31372.01 | 53446.44 |
| April | 502 | 216535.7 | 104 | 64776.26 | 67336.51 |
| May | 422 | 187970.6 | 73 | 46197.63 | 67350.46 |
| June | 836 | 404259.1 | 171 | 112416.13 | 131115.98 |
| July | 432 | 221443.5 | 78 | 59118.01 | 72777.06 |
| August | 660 | 320076.1 | 134 | 91569.28 | 105717.78 |
| September | 796 | 403586.7 | 157 | 111181.37 | 135411.51 |
| October | 528 | 238270.5 | 99 | 61117.43 | 84205.76 |
| November | 740 | 348829.8 | 136 | 86982.17 | 129045.46 |
| December | 736 | 334867.3 | 135 | 89142.64 | 115255.93 |
I wanted to see whether there are any interesting patterns in the data aggregated on a monthly level.
It seems that June had the highest values for all the variables except for profits, where September is leading by 4295 dollars and 53 cents. Table 4 only shows summarized values but as the last part of this analysis, I would like to see the distribution of losses throughout the year, as it seems that the total loss in June has been so high that it caused it to have a lower profit despite having the highest volume of sales.
# Changing months to factors and ordering them for the graph
df$month = factor(df$month, levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
# Box plot
ggplot(df, aes(x = month, y = Total_loss, fill= month)) +
geom_boxplot() +
scale_fill_brewer(palette = "Spectral") +
labs(title = "Distribution of losses throughout the year", x = "Month", y = "Total loss") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust=0.6))
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Spectral is 11
## Returning the palette you asked for with that many colors
Interestingly enough, the highest value of total loss corresponds to an order placed in July. It is shown by an outlier with a value close to $12.000. After looking at the raw data again, it turned out that it is the exactly same order that has been mentioned earlier - a customer ordered 11 Samsung Phones for $3558 each and returned 3 of them.
There are some months with higher median losses, such as April, August, and September. It seems that the cause of a higher loss in June compared to September is related to the outliers - there seem to be orders made in June that created bigger losses. Additionally, after looking at table 4 again, it becomes clear that there has also been the highest number of returns, which most likely caused the highest loss.
The distribution of total losses varies between the months - some are characterized by the data points being more spread out in the 1st quarter (for example November or July) and some consist of more spread out data in the 3rd quartile (for instance February, April, and June). February is the only month where the data doesn’t include any outliers.
# I'd like to explore average losses per customer segment
segm_loss = tapply(df$Total_loss, df$Department, mean)
dotchart(segm_loss, pch = 24, bg = "purple1", cex = 1.3,
xlab="Average losses",
main = "What is the average loss \nfor each department?")
It turns out that actually furniture has noted the highest losses on average, and as it could be seen on figure 8, there aren’t more orders of furniture than the other two categories.
CONCLUSIONS
Data analysis is an iterative process. After carefully analyzing this dataset, I came across some additional questions that could improve this analysis and provide additional insights. If I would actually work for this company, I would have come back to different departments to ask for additional clarifications.
For instance:
Those are just the questions that affect this analysis directly, as for example different ways of registering financial metrics would have led to different conclusions when we analyze a particular day/month/quarter. If there’s a policy that a product can be returned within 30 days, an order would have been registered as a profit in one month, and then as a loss in the following one. That’s why analysts need to ask many questions and use critical thinking to find the reasons why the trends are as they are.
Data analysts often need to come back to previous steps of the process and collect more data, reformulate a problem, or find other methods to solve it. This project has helped me to come up with some interesting questions that can be answered using data. I could also practice using visualizations to see some patterns (and find out that some visualizations aren’t very helpful to communicate insights).
If this project would be delivered to a manager, I would recommend focusing on losses and returns. We know that furniture has the highest losses on average and that the home office segment of customers shows the same trend. Maybe there have been some clients that ordered furniture for their home offices and returned it when they no longer needed to work from home? It would definitely be worth investigating further in order to prevent losing money on returned products in the future. My recommendation for the future course of this analysis would be looking at numbers of returned items per order, drilling down to see which products are being returned most often, checking which orders have caused the highest losses and from there it would have been possible to identify either faulty products or particular customers that return their orders most frequently. If there is a pattern, the analyzed company could make more profits by preventing returns.
BIBLIOGRAPHY
Atkins C., Valdivieso De Uster, M., Mahdavian, M., Yee, L. (2016),Unlocking the power of data in sales, McKinsey, Retrieved from: https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/unlocking-the-power-of-data-in-sales
Bluman, A. G. (2018). Elementary statistics: A Step by Step Approach: A Brief Version. New York: McGraw-Hill Higher Education; 10th edition.
Jones, B. (2020). Learning to See Data: How to Interpret the Visual Language of Charts. Data Literacy Press; 1st edition.
Oetting, J. Data Visualization 101: How to Choose the Right Chart or Graph for Your Data, Hubspot, Retrieved from: https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization
R Bloggers (2013), Date Formats in R, Retrieved from: https://www.r-bloggers.com/2013/08/date-formats-in-r/