The data I will use is “Retail Dataset”, that contains of sales record included the profit or loss from a retail company for 4 years. Please take a look at the following glossary for your reference:
Order.ID: Id of order.Order.Date: Date of order.Ship.Date: Date of shipment.Ship.Mode: Type of shipment.Customer.ID: Id of customer.Segment : Customer’s segment.Product.ID: Id of product.Category : Product category with 3 levels (“Furniture”, “Office Supplies”, “Technology”)Sub.Category: more specific product categoryProduct.Name: Name of product that was sold.Sales: How much earning from each sale.Quantity: Quantity of item sold.Discount: How much discount was given for each sale.Profit: How much can a company earn from each sale.library(dplyr)## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)## Warning: package 'ggplot2' was built under R version 4.0.5
library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.5 v purrr 0.3.4
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'purrr' was built under R version 4.0.5
## Warning: package 'stringr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
retail <- read.csv("retail.csv")
head(retail)explicit coercion
str(retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Change the types of data into the appropriate data types
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Segment <- as.factor(retail$Segment)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)
retail$Order.Date <- mdy(retail$Order.Date)
retail$Ship.Date <- mdy(retail$Ship.Date)
str(retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Check Missing Value
anyNA(retail)## [1] FALSE
No missing values found
Drop the first column because we don’t need that information
retail <- retail[,-1]
head(retail)summary(retail)## Order.ID Order.Date Ship.Date
## Length:9994 Min. :2014-01-03 Min. :2014-01-07
## Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
## Mode :character Median :2016-06-26 Median :2016-06-29
## Mean :2016-04-30 Mean :2016-05-03
## 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :2017-12-30 Max. :2018-01-05
##
## Ship.Mode Customer.ID Segment Product.ID
## First Class :1538 Length:9994 Consumer :5191 Length:9994
## Same Day : 543 Class :character Corporate :3020 Class :character
## Second Class :1945 Mode :character Home Office:1783 Mode :character
## Standard Class:5968
##
##
##
## Category Sub.Category Product.Name
## Furniture :2121 Binders :1523 Length:9994
## Office Supplies:6026 Paper :1370 Class :character
## Technology :1847 Furnishings: 957 Mode :character
## Phones : 889
## Storage : 846
## Art : 796
## (Other) :3613
## Sales Quantity Discount Profit
## Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
##
Data Summary
Which customer_id had the biggest sales?
filter(retail, Sales==max(Sales))Answer : The biggest sales were made by customers with id SM-20320
What were the sub-categories in Office Supplies and how much was the total profit for each
retail %>% filter(Category=='Office Supplies') %>%
group_by(Category,Sub.Category) %>% summarise(total_profit=sum(Profit)) %>%
arrange(desc(total_profit))## `summarise()` has grouped output by 'Category'. You can override using the `.groups` argument.
Answer : In office supplies there were nine items available, namely Appliances, Art, Binders, Envelopes, Fasteners, Labels, Paper, Storage, Supplies. From all that items in Office Supplies, Paper had the biggest profit and Supplies was the lowest one.
How many orders resulted in a loss?
nrow(filter(retail, Profit<0))## [1] 1871
Answer : There are 1871 orders that resulted in a loss
From the customer_id JE-16165, KH-16510 and AD-10180. Which one had the biggest total sales
retail %>% filter(Customer.ID %in% c('JE-16165', 'KH-16510', 'AD-10180')) %>%
group_by(Customer.ID) %>%
summarise(total_sales=sum(Sales)) %>%
arrange(desc(total_sales))Answer : From that three Customer id, AD-10180 had the biggest total sales which is 6106.880 dollars. Then, the lowest one is JE-16165.
Make a new Data Frame named yearly_sales that contains total sales, number of orders, and total profit for each year
yearly_sales <- retail %>% mutate(year=year(Order.Date)) %>%
group_by(year) %>%
summarise(total_sales=sum(Sales),
total_profit=sum(Profit),
total_orders=n()) %>%
arrange(desc(total_sales))
yearly_salesThe yearly_sales dataset shows that the highest total_sales, total_profits and total_orders occurred in 2017, while the lowest total_sales occurred in 2015 but the lowest total_profits and total_orders occurred in 2014.