Exploratory Data Analysis

Dataset

The data I will use is “Retail Dataset”, that contains of sales record included the profit or loss from a retail company for 4 years. Please take a look at the following glossary for your reference:

Order.ID: Id of order.
Order.Date: Date of order.
Ship.Date: Date of shipment.
Ship.Mode: Type of shipment.
Customer.ID: Id of customer.
Segment : Customer’s segment.
Product.ID: Id of product.
Category : Product category with 3 levels (“Furniture”, “Office Supplies”, “Technology”)
Sub.Category: more specific product category
Product.Name: Name of product that was sold.
Sales: How much earning from each sale.
Quantity: Quantity of item sold.
Discount: How much discount was given for each sale.
Profit: How much can a company earn from each sale.

Import Libraries

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.5     v purrr   0.3.4
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## Warning: package 'tibble' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'readr' was built under R version 4.0.5

## Warning: package 'purrr' was built under R version 4.0.5

## Warning: package 'stringr' was built under R version 4.0.5

## Warning: package 'forcats' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

Import Data

retail <- read.csv("retail.csv")
head(retail)

Data Cleansing

explicit coercion

str(retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Change the types of data into the appropriate data types

retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Segment <- as.factor(retail$Segment)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)
retail$Order.Date <- mdy(retail$Order.Date)
retail$Ship.Date <- mdy(retail$Ship.Date)

str(retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Check Missing Value

anyNA(retail)

## [1] FALSE

No missing values found

Drop the first column because we don’t need that information

retail <- retail[,-1]
head(retail)

Data Explanation

summary(retail)

##    Order.ID           Order.Date           Ship.Date         
##  Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
##  Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  Mode  :character   Median :2016-06-26   Median :2016-06-29  
##                     Mean   :2016-04-30   Mean   :2016-05-03  
##                     3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##                     Max.   :2017-12-30   Max.   :2018-01-05  
##                                                              
##           Ship.Mode    Customer.ID               Segment      Product.ID       
##  First Class   :1538   Length:9994        Consumer   :5191   Length:9994       
##  Same Day      : 543   Class :character   Corporate  :3020   Class :character  
##  Second Class  :1945   Mode  :character   Home Office:1783   Mode  :character  
##  Standard Class:5968                                                           
##                                                                                
##                                                                                
##                                                                                
##             Category         Sub.Category  Product.Name      
##  Furniture      :2121   Binders    :1523   Length:9994       
##  Office Supplies:6026   Paper      :1370   Class :character  
##  Technology     :1847   Furnishings: 957   Mode  :character  
##                         Phones     : 889                     
##                         Storage    : 846                     
##                         Art        : 796                     
##                         (Other)    :3613                     
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
##

Data Summary

This data records 4 years of sales from 2014-2017
The last shipment occurred in January 2018
Standard class was the most preferred type of shipment and same day was the lowest
The most buyers coming from consumer segment and home office segment was the lowest
Office Supply category was the best seller compare to others
In Sub category, Other stuff was the most favorite one then followed by Binders and paper. Art is the lowest
The average sales at 229.858; with max sales value at 22638.480 and minimum sales value at 0.444
Max quantity order was 14 and the minimum was 1
The maximum discount for an item is 80%
The biggest lost for one item is 6599.978 dollars
The maximum profit is 8399.976 dollars

Subsetting

Which customer_id had the biggest sales?

filter(retail, Sales==max(Sales))

Answer : The biggest sales were made by customers with id SM-20320

What were the sub-categories in Office Supplies and how much was the total profit for each

retail %>% filter(Category=='Office Supplies') %>%
    group_by(Category,Sub.Category) %>% summarise(total_profit=sum(Profit)) %>% 
  arrange(desc(total_profit))

## `summarise()` has grouped output by 'Category'. You can override using the `.groups` argument.

Answer : In office supplies there were nine items available, namely Appliances, Art, Binders, Envelopes, Fasteners, Labels, Paper, Storage, Supplies. From all that items in Office Supplies, Paper had the biggest profit and Supplies was the lowest one.

How many orders resulted in a loss?

nrow(filter(retail, Profit<0))

## [1] 1871

Answer : There are 1871 orders that resulted in a loss

From the customer_id JE-16165, KH-16510 and AD-10180. Which one had the biggest total sales

retail %>% filter(Customer.ID %in% c('JE-16165', 'KH-16510', 'AD-10180')) %>% 
  group_by(Customer.ID) %>% 
  summarise(total_sales=sum(Sales)) %>% 
  arrange(desc(total_sales))

Answer : From that three Customer id, AD-10180 had the biggest total sales which is 6106.880 dollars. Then, the lowest one is JE-16165.

Data Extraction

Make a new Data Frame named yearly_sales that contains total sales, number of orders, and total profit for each year

yearly_sales <- retail %>% mutate(year=year(Order.Date)) %>% 
  group_by(year) %>% 
  summarise(total_sales=sum(Sales),
            total_profit=sum(Profit),
            total_orders=n()) %>% 
  arrange(desc(total_sales))
yearly_sales

The yearly_sales dataset shows that the highest total_sales, total_profits and total_orders occurred in 2017, while the lowest total_sales occurred in 2015 but the lowest total_profits and total_orders occurred in 2014.