1 General Brief

This is Learning By Building Project for P4DS material. The data used for this analysis is sales report from tech shop in 3 months duration (January - March). In this data, there are information about Order ID, Product, Quantity Ordered, Price, Order Date, and Purchase Address. This analysis will be separated into 5 parts:

Read and check data
Cleaning the data
Data exploration
Conclusion
Reference

Let’s find insight from this 3 months data.

1.1 Business Question

Before going further, having a clear goal is a must in analysis. There are several business question as our main goal that we must answer.

What is the average price for each item in this tech shop?
How is sales condition in general? Is it good or bad?
How is the detail sales in every month?
If the owner want to make discount program, what day will be the best day for discount program?
If the owner of the shop wants to make expansion, which area/region will you recommend?

2 Read and Check Data

First, let’s read the data. The original data are split into 3 parts, let’s read it and make it into one object

# Read the data
jan <- read.csv("data_input/Sales_January_2019.csv")
feb <- read.csv("data_input/Sales_February_2019.csv")
mar <- read.csv("data_input/Sales_March_2019.csv")

sales <- rbind(jan,feb,mar)
sales

3 Data Cleansing and Coertion

3.1 Data Types

Let’s observe the data types, and change it into match types.

str(sales)

#> 'data.frame':    36985 obs. of  6 variables:
#>  $ Order.ID        : chr  "141234" "141235" "141236" "141237" ...
#>  $ Product         : chr  "iPhone" "Lightning Charging Cable" "Wired Headphones" "27in FHD Monitor" ...
#>  $ Quantity.Ordered: chr  "1" "1" "2" "1" ...
#>  $ Price.Each      : chr  "700" "14.95" "11.99" "149.99" ...
#>  $ Order.Date      : chr  "01/22/19 21:25" "01/28/19 14:15" "01/17/19 13:33" "01/05/19 20:33" ...
#>  $ Purchase.Address: chr  "944 Walnut St, Boston, MA 02215" "185 Maple St, Portland, OR 97035" "538 Adams St, San Francisco, CA 94016" "738 10th St, Los Angeles, CA 90001" ...

There are column with mismatch types:

Quantity.Ordered and Price.Each should be in Integer/Numeric
Product should be in Category/Factor
Order.Date should be in Time Format

Let’s convert it and assign it into sales_clean.

#Making copy of data frame from sales dataframe
sales_clean <- sales

#Converting the data types
sales_clean$Quantity.Ordered <- as.integer(sales_clean$Quantity.Ordered)
sales_clean$Price.Each <- as.numeric(sales_clean$Price.Each)
sales_clean$Product <- as.factor(sales_clean$Product)
sales_clean$Order.Date <- as.POSIXct(sales_clean$Order.Date, format = "%m/%d/%y %H:%M")

Check the data types after conversion.

str(sales_clean)

#> 'data.frame':    36985 obs. of  6 variables:
#>  $ Order.ID        : chr  "141234" "141235" "141236" "141237" ...
#>  $ Product         : Factor w/ 21 levels "","20in Monitor",..: 12 15 21 4 21 7 3 19 9 8 ...
#>  $ Quantity.Ordered: int  1 1 2 1 1 1 1 1 1 1 ...
#>  $ Price.Each      : num  700 14.9 12 150 12 ...
#>  $ Order.Date      : POSIXct, format: "2019-01-22 21:25:00" "2019-01-28 14:15:00" ...
#>  $ Purchase.Address: chr  "944 Walnut St, Boston, MA 02215" "185 Maple St, Portland, OR 97035" "538 Adams St, San Francisco, CA 94016" "738 10th St, Los Angeles, CA 90001" ...

3.2 Missing Value

Let’s observe our data, and check if there are any missing values.

colSums(is.na(sales_clean))

#>         Order.ID          Product Quantity.Ordered       Price.Each 
#>                0                0              164              164 
#>       Order.Date Purchase.Address 
#>              164                0

There are 164 missing values in Order.Date column. Let’s take a look deeper.

isna_data <- is.na(sales_clean)
sales_clean[sales_clean == isna_data,]

The data with missing value is empty data (without information), it will be okay if we drop the rows.

sales_clean <- na.omit(sales_clean)

anyNA(sales_clean)

#> [1] FALSE

Now the data has no rows with missing value.

4 Data Exploration

After we got clean data, we can process the data to get insight from it. We can use summary() to get quick insight from our data.

summary(sales_clean)

#>    Order.ID                             Product      Quantity.Ordered
#>  Length:36821       USB-C Charging Cable    : 4455   Min.   :1.000   
#>  Class :character   Lightning Charging Cable: 4216   1st Qu.:1.000   
#>  Mode  :character   AAA Batteries (4-pack)  : 4051   Median :1.000   
#>                     AA Batteries (4-pack)   : 3990   Mean   :1.123   
#>                     Wired Headphones        : 3702   3rd Qu.:1.000   
#>                     Apple Airpods Headphones: 3132   Max.   :7.000   
#>                     (Other)                 :13275                   
#>    Price.Each        Order.Date                  Purchase.Address  
#>  Min.   :   2.99   Min.   :2019-01-01 03:07:00   Length:36821      
#>  1st Qu.:  11.95   1st Qu.:2019-01-30 12:38:00   Class :character  
#>  Median :  14.95   Median :2019-02-21 12:45:00   Mode  :character  
#>  Mean   : 184.37   Mean   :2019-02-19 13:28:53                     
#>  3rd Qu.: 150.00   3rd Qu.:2019-03-13 09:14:00                     
#>  Max.   :1700.00   Max.   :2019-04-01 03:21:00                     
#>

From this summary, we can assume that:

There are 36821 transaction from January until March
The cheapest item sold in this tech shop is $2.99 and the most expensive is $1700
- We will observe price for each item later on.
The Highest quantity product sold is USB-C Charging Cable.
- Maybe people tend to buy charging cable, because it is one of the universal cable that can charge every device we have and of course it will be better to have spare, or maybe the price of charging cable is quite affordable in this shop.

4.1 Average Price for Each Item

# Agregation table for price 
product_price <- aggregate(formula = Price.Each ~ Product, data = sales_clean, FUN = mean)

# Ordering table in descending
product_price_ordered <- order(product_price$Price.Each,decreasing = TRUE)
product_price[product_price_ordered,]

From this aggregate table, we know that the most expensive product is Macbook Pro Laptop, and the cheapest item is AAA Batteries (4-pack).

4.2 How is Sales Condition in Tech Shop?

To answer this question, we will need extra column for total sales in each transaction. We can make this total_sales column by multiplying Price.Each and Quantity.Ordered.

sales_clean$total_sales <- sales_clean$Price.Each * sales_clean$Quantity.Ordered
head(sales_clean)

Now we have total sales information for each transaction. Let’s count how much does the shop make in 3 months.

sum(sales_clean$total_sales)

#> [1] 6827880

If we assume the currency used in data is USD, then in 3 months the tech shop has gain cumulative sales $6,827,880.

4.3 How is Shop Sales Condition Each Day/Month?

To answer this question, we have to make extra columns that contain information about Day and Month. Luckily we have time information for every transaction. Let’s extract the information from Order.Date column. To help our process, we will use lubridate package.

library(lubridate)
sales_clean$monthname <- month(sales_clean$Order.Date, label = TRUE, abbr = F)
sales_clean$dayname <- wday(sales_clean$Order.Date, label = T)
head(sales_clean)

Now we can make aggregation table for total sales in every month or every day.

4.3.1 Monthly Sales

monthly_sales <- 
aggregate(formula = total_sales~monthname, data = sales_clean, FUN = sum)
monthly_sales

To make it easier, let’s use simple visualization to compare sales in every month.

From the table and graph we know that:

The sales from January until March is increasing
The highest total sales is in March

4.3.2 Daily Sales

Let’s check day to day transaction, are there any days with more transaction than others?

dailysales <- data.frame(xtabs(formula = ~dayname, data = sales_clean))
dailysales

# Ordering the dailysales based on Freq data
daily_ordered <- order(dailysales$Freq, decreasing = T)
dailysales <- dailysales[daily_ordered,]
dailysales

From the data we know that:

People tend to shop at weekend (Saturday and Sunday)

4.4 Transaction in Every Region

First, we have to extract region information from Purchase.Addres column, and assign in it new column region.

library(stringr)
sales_clean$region <- word(sales_clean$Purchase.Address, 2, sep = fixed(","))
sales_clean$region <- trimws(sales_clean$region, which = c("left"))
sales_clean$region <- as.factor(sales_clean$region)
head(sales_clean)

Now we can make aggregation table base on region.

region_transaction <- data.frame(xtabs(formula = ~region, data = sales_clean))
region_transaction_ordered <- order(region_transaction$Freq, decreasing=TRUE)
region_transaction <- region_transaction[region_transaction_ordered,]
region_transaction

From the data we know that: - Total transaction with San Francisco as address is the highest, recorded more than 8000 transaction in 3 months.

5 Conclusion

We started our analysis with question, now let’s answer it.

We already make table average price for each item. The most expensive item is macbook pro laptop, and the cheapest is AAA Batteries (4-pack).
We can say the sales condition in 3 months is very good with total cumulative sales $6,827,880.
Total sales is increasing from January until March. Recorded in March total sales at $2,807,100.38.
People tend to shop at weekend (Saturday and Sunday) and the lowest transaction recorded is on Monday. Maybe if we make our discount program in Saturday, Sunday and Monday (3 days in a row) it will make total transaction in weekend getting higher and increasing total transaction in Monday.
Base on transaction in every region, San Francisco recorded as region with highest total transaction, and followed by Los Angeles. Maybe this 2 region can be considered as the best place to start expansion.

That will be sum up of our analysis report. Any further research with more complete data and proper analysis method is very possible.

6 Reference

The data set in this analysis is provided by Kilian Tscherny and can be downloaded in Kaggle.

Sales Data from Kaggle

Tech Shop Sales Analysis

Elsimanata Putro

February 11, 2022