This is Learning By Building Project for P4DS material. The data used for this analysis is sales report from tech shop in 3 months duration (January - March). In this data, there are information about Order ID, Product, Quantity Ordered, Price, Order Date, and Purchase Address. This analysis will be separated into 5 parts:
Let’s find insight from this 3 months data.
Before going further, having a clear goal is a must in analysis. There are several business question as our main goal that we must answer.
First, let’s read the data. The original data are split into 3 parts, let’s read it and make it into one object
# Read the data
jan <- read.csv("data_input/Sales_January_2019.csv")
feb <- read.csv("data_input/Sales_February_2019.csv")
mar <- read.csv("data_input/Sales_March_2019.csv")
sales <- rbind(jan,feb,mar)
salesLet’s observe the data types, and change it into match types.
str(sales)#> 'data.frame': 36985 obs. of 6 variables:
#> $ Order.ID : chr "141234" "141235" "141236" "141237" ...
#> $ Product : chr "iPhone" "Lightning Charging Cable" "Wired Headphones" "27in FHD Monitor" ...
#> $ Quantity.Ordered: chr "1" "1" "2" "1" ...
#> $ Price.Each : chr "700" "14.95" "11.99" "149.99" ...
#> $ Order.Date : chr "01/22/19 21:25" "01/28/19 14:15" "01/17/19 13:33" "01/05/19 20:33" ...
#> $ Purchase.Address: chr "944 Walnut St, Boston, MA 02215" "185 Maple St, Portland, OR 97035" "538 Adams St, San Francisco, CA 94016" "738 10th St, Los Angeles, CA 90001" ...
There are column with mismatch types:
Quantity.Ordered and Price.Each should be in Integer/NumericProduct should be in Category/FactorOrder.Date should be in Time FormatLet’s convert it and assign it into sales_clean.
#Making copy of data frame from sales dataframe
sales_clean <- sales
#Converting the data types
sales_clean$Quantity.Ordered <- as.integer(sales_clean$Quantity.Ordered)
sales_clean$Price.Each <- as.numeric(sales_clean$Price.Each)
sales_clean$Product <- as.factor(sales_clean$Product)
sales_clean$Order.Date <- as.POSIXct(sales_clean$Order.Date, format = "%m/%d/%y %H:%M")Check the data types after conversion.
str(sales_clean)#> 'data.frame': 36985 obs. of 6 variables:
#> $ Order.ID : chr "141234" "141235" "141236" "141237" ...
#> $ Product : Factor w/ 21 levels "","20in Monitor",..: 12 15 21 4 21 7 3 19 9 8 ...
#> $ Quantity.Ordered: int 1 1 2 1 1 1 1 1 1 1 ...
#> $ Price.Each : num 700 14.9 12 150 12 ...
#> $ Order.Date : POSIXct, format: "2019-01-22 21:25:00" "2019-01-28 14:15:00" ...
#> $ Purchase.Address: chr "944 Walnut St, Boston, MA 02215" "185 Maple St, Portland, OR 97035" "538 Adams St, San Francisco, CA 94016" "738 10th St, Los Angeles, CA 90001" ...
Let’s observe our data, and check if there are any missing values.
colSums(is.na(sales_clean))#> Order.ID Product Quantity.Ordered Price.Each
#> 0 0 164 164
#> Order.Date Purchase.Address
#> 164 0
There are 164 missing values in Order.Date column. Let’s take a look deeper.
isna_data <- is.na(sales_clean)
sales_clean[sales_clean == isna_data,]The data with missing value is empty data (without information), it will be okay if we drop the rows.
sales_clean <- na.omit(sales_clean)
anyNA(sales_clean)#> [1] FALSE
Now the data has no rows with missing value.
After we got clean data, we can process the data to get insight from it. We can use summary() to get quick insight from our data.
summary(sales_clean)#> Order.ID Product Quantity.Ordered
#> Length:36821 USB-C Charging Cable : 4455 Min. :1.000
#> Class :character Lightning Charging Cable: 4216 1st Qu.:1.000
#> Mode :character AAA Batteries (4-pack) : 4051 Median :1.000
#> AA Batteries (4-pack) : 3990 Mean :1.123
#> Wired Headphones : 3702 3rd Qu.:1.000
#> Apple Airpods Headphones: 3132 Max. :7.000
#> (Other) :13275
#> Price.Each Order.Date Purchase.Address
#> Min. : 2.99 Min. :2019-01-01 03:07:00 Length:36821
#> 1st Qu.: 11.95 1st Qu.:2019-01-30 12:38:00 Class :character
#> Median : 14.95 Median :2019-02-21 12:45:00 Mode :character
#> Mean : 184.37 Mean :2019-02-19 13:28:53
#> 3rd Qu.: 150.00 3rd Qu.:2019-03-13 09:14:00
#> Max. :1700.00 Max. :2019-04-01 03:21:00
#>
From this summary, we can assume that:
$2.99 and the most expensive is $1700
# Agregation table for price
product_price <- aggregate(formula = Price.Each ~ Product, data = sales_clean, FUN = mean)
# Ordering table in descending
product_price_ordered <- order(product_price$Price.Each,decreasing = TRUE)
product_price[product_price_ordered,]From this aggregate table, we know that the most expensive product is Macbook Pro Laptop, and the cheapest item is AAA Batteries (4-pack).
To answer this question, we will need extra column for total sales in each transaction. We can make this total_sales column by multiplying Price.Each and Quantity.Ordered.
sales_clean$total_sales <- sales_clean$Price.Each * sales_clean$Quantity.Ordered
head(sales_clean)Now we have total sales information for each transaction. Let’s count how much does the shop make in 3 months.
sum(sales_clean$total_sales)#> [1] 6827880
If we assume the currency used in data is USD, then in 3 months the tech shop has gain cumulative sales $6,827,880.
To answer this question, we have to make extra columns that contain information about Day and Month. Luckily we have time information for every transaction. Let’s extract the information from Order.Date column. To help our process, we will use lubridate package.
library(lubridate)
sales_clean$monthname <- month(sales_clean$Order.Date, label = TRUE, abbr = F)
sales_clean$dayname <- wday(sales_clean$Order.Date, label = T)
head(sales_clean)Now we can make aggregation table for total sales in every month or every day.
monthly_sales <-
aggregate(formula = total_sales~monthname, data = sales_clean, FUN = sum)
monthly_salesTo make it easier, let’s use simple visualization to compare sales in every month.
From the table and graph we know that:
The highest total sales is in MarchLet’s check day to day transaction, are there any days with more transaction than others?
dailysales <- data.frame(xtabs(formula = ~dayname, data = sales_clean))
dailysales# Ordering the dailysales based on Freq data
daily_ordered <- order(dailysales$Freq, decreasing = T)
dailysales <- dailysales[daily_ordered,]
dailysalesFrom the data we know that:
First, we have to extract region information from Purchase.Addres column, and assign in it new column region.
library(stringr)
sales_clean$region <- word(sales_clean$Purchase.Address, 2, sep = fixed(","))
sales_clean$region <- trimws(sales_clean$region, which = c("left"))
sales_clean$region <- as.factor(sales_clean$region)
head(sales_clean)Now we can make aggregation table base on region.
region_transaction <- data.frame(xtabs(formula = ~region, data = sales_clean))
region_transaction_ordered <- order(region_transaction$Freq, decreasing=TRUE)
region_transaction <- region_transaction[region_transaction_ordered,]
region_transactionFrom the data we know that: - Total transaction with San Francisco as address is the highest, recorded more than 8000 transaction in 3 months.
We started our analysis with question, now let’s answer it.
macbook pro laptop, and the cheapest is AAA Batteries (4-pack).That will be sum up of our analysis report. Any further research with more complete data and proper analysis method is very possible.
The data set in this analysis is provided by Kilian Tscherny and can be downloaded in Kaggle.