Extracting useful information using R and SQL

0.0.1 Load in required packages

library(odbc)
library(DBI)
library(RSQLite)
library(tidyverse)

options(scipen = 999)

0.0.2 Read in data

sales<-readr::read_csv("sales.csv")

0.0.3 What variables do we have

names(sales)

## [1] "Order ID"         "Product"          "Quantity Ordered" "Price Each"      
## [5] "Order Date"       "Purchase Address"

0.0.4 Objectives only for

what city sold the most and made most profit
which products had the largest sales
what time should we display advertisements ?
which days are most sales made and most quantities sold
which month are most sales made

0.0.5 look at the data types

sapply(sales[1,],class)

##         Order ID          Product Quantity Ordered       Price Each 
##      "character"      "character"      "character"      "character" 
##       Order Date Purchase Address 
##      "character"      "character"

0.0.6 look at missing data

colSums(is.na(sales))

##         Order ID          Product Quantity Ordered       Price Each 
##              545              545              545              545 
##       Order Date Purchase Address 
##              545              545

0.0.7 whats wrong with the data

try to find out where the missing values are coming from , will check the row numbers with missing values.

sales |>
  mutate(id=row_number()) |> 
  relocate(id) |> 
  filter(is.na(`Price Each`))

now that we have seen that the columns are missing on common rows ,we can now remove all these!

sales<-sales |> 
  filter(!is.na(`Order ID`))

0.0.8 Glimpse at the data

glimpse(sales)

## Rows: 186,305
## Columns: 6
## $ `Order ID`         <chr> "295665", "295666", "295667", "295668", "295669", "~
## $ Product            <chr> "Macbook Pro Laptop", "LG Washing Machine", "USB-C ~
## $ `Quantity Ordered` <chr> "1", "1", "1", "1", "1", "1", "1", "2", "1", "4", "~
## $ `Price Each`       <chr> "1700", "600.0", "11.95", "149.99", "11.95", "3.84"~
## $ `Order Date`       <chr> "12/30/19 00:01", "12/29/19 07:03", "12/12/19 18:21~
## $ `Purchase Address` <chr> "136 Church St, New York City, NY 10001", "562 2nd ~

why are price each and order quantity variables characters?

why is price each a character?…from the output of glimpse what can we see?

sales |> 
mutate(id=row_number()) |>  
relocate(id) |> 
filter(grepl("Price",`Price Each`,perl = T))

Looks like there has been some typo in the data as all the column observations contain the column name so let us drop these rows and rename certain columns

sales<-sales |> 
filter(!grepl("Price",`Price Each`,perl = T)) |> #drop unwanted rows
  rename(quantity=`Quantity Ordered`) |> 
  rename(address = `Purchase Address`) |> 
  rename(Price = `Price Each`)

0.0.9 do we have any duplicates?

dim(sales)

## [1] 185950      6

we have 186305 rows in our data

sales |> 
  summarise(unique_rows=n_distinct(`Order ID`))

but we have 178438 unique rows … this is because one person could order different products

sales |> 
  janitor::tabyl(quantity) |> 
  as_tibble() |> 
  mutate(percent=round(percent,4))

well,people have a tendency of ordering what`s enough for them so its not surprising that most people would order one thing

look at class again

sapply(sales[1,],class)

##    Order ID     Product    quantity       Price  Order Date     address 
## "character" "character" "character" "character" "character" "character"

0.0.10 Extracting more information from the data

let’s begin off by using R to achieve this and then later on use SQL for this!

0.0.11 lets look at address column

sales |> 
  relocate(address)

one should note that this column contains useful information such as the street , city and zip . we need to extract the city and create a column for this!

we can use the function separate() to achieve this!

(sales_df<-sales |> 
  separate(address,into=c("address1","city","code"),",") |> 
  mutate_at(c(3,4),as.numeric) |> 
  mutate_if(is.character,as.factor) |> 
  relocate(c("address1","city","code")))

nice , now we have the variable city standing on its own!

now I can store this as a database! for comparisons sake

0.1 Storing as database

sales_df<-sales_df |> 
  mutate(Date=mdy_hm(`Order Date`),
         Date1=as.character(Date))

sold<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(sold,sales_df)

0.2

0.2.1 cities recorded : R

sales_df |> 
  summarise(n_unique=n_distinct(city))

0.2.2 cities recorded : SQL

SELECT COUNT(DISTINCT city) AS n_unique 
FROM sales_df;

1 records
n_unique
9

SELECT DISTINCT city AS n_unique 
FROM sales_df;

9 records
n_unique
New York City
San Francisco
Atlanta
Portland
Dallas
Los Angeles
Boston
Austin
Seattle

0.3

0.3.1 Number of Orders per city: R

sales_df |> 
  group_by(city) |> 
  dplyr::summarise(sum_quantity=sum(quantity,na.rm=T)) |> 
  arrange(desc(sum_quantity))

San Francisco had the most number of quantities sold as compared to other cities.

0.3.2 Number of Orders per city: SQL

SELECT city ,SUM(quantity) AS sum_quantity 
FROM sales_df
GROUP BY city
ORDER BY sum_quantity DESC;

9 records
city	sum_quantity
San Francisco	50239
Los Angeles	33289
New York City	27932
Boston	22528
Dallas	16730
Atlanta	16602
Seattle	16553
Portland	14053
Austin	11153

0.4

0.4.1 mean sales per city: R

sales_df |> 
  group_by(city) |> 
  dplyr::summarise(mean_sale=mean(Price,na.rm=T)) |> 
  arrange(desc(mean_sale))

Atlanta had the greatest average sales as compared to other cities.

0.4.2 mean sales per city: SQL

SELECT city ,AVG(Price) AS mean_price 
FROM sales_df
GROUP BY city
ORDER BY mean_price DESC;

9 records
city	mean_price
Atlanta	186.8092
New York City	186.3391
Dallas	185.7374
Seattle	185.5346
Portland	185.1382
San Francisco	183.5702
Los Angeles	183.1257
Austin	182.7232
Boston	182.4726

0.5

0.5.1 what are the top 10 products sold in quantity : R

sales_df |> 
  group_by(Product) |> 
  dplyr::summarise(num_quantity=sum(quantity,na.rm=T)) |> 
  arrange(desc(num_quantity)) |> 
  head(n=10)

AAA Batteries (4-pack) had the most number of quantities sold

0.5.2 what are the top 10 products sold in quantity : SQL

SELECT Product ,SUM(quantity) AS num_quantity
FROM sales_df
GROUP BY Product
ORDER BY num_quantity DESC
LIMIT 10;

Displaying records 1 - 10
Product	num_quantity
AAA Batteries (4-pack)	31017
AA Batteries (4-pack)	27635
USB-C Charging Cable	23975
Lightning Charging Cable	23217
Wired Headphones	20557
Apple Airpods Headphones	15661
Bose SoundSport Headphones	13457
27in FHD Monitor	7550
iPhone	6849
27in 4K Gaming Monitor	6244

0.6

0.6.1 what are the top 10 products in sales : R

sales_df |> 
  group_by(Product) |> 
  dplyr::summarise(sum_price=sum(Price*quantity)) |> 
  arrange(desc(sum_price)) |> 
  head(n=10)

0.6.2 what are the top 10 products in sales : SQL

SELECT Product ,SUM(Price*quantity) AS sum_price
FROM sales_df
GROUP BY Product
ORDER BY sum_price DESC
LIMIT 10;

Displaying records 1 - 10
Product	sum_price
Macbook Pro Laptop	8037600
iPhone	4794300
ThinkPad Laptop	4129959
Google Phone	3319200
27in 4K Gaming Monitor	2435098
34in Ultrawide Monitor	2355558
Apple Airpods Headphones	2349150
Flatscreen TV	1445700
Bose SoundSport Headphones	1345565
27in FHD Monitor	1132424

0.7 Aggregating by day and time only focusing on 2019

lets look at the date column

sales_df |> 
  relocate(`Order Date`)

the date column contains some wealth of information so we will extract year,month ,day and hour using lubridate package.

sales_df<-sales_df |> 
  mutate(Date=mdy_hm(`Order Date`), # change to date
         year=year(Date),           # extract year
         Month=month(Date,label=T), # extract month
         Day=wday(Date,label=T),    # extract weekday
         hour=hour(Date),           # extract hour 
         quarter=quarter(Date),     # extract quarter 
         am_pm=pm(Date),            # extract day or evening
         Date1=as.character(Date))  
sales_df

0.8

0.8.1 what month is best for sales : R

sales_df |> 
  filter(year==2019) |> 
  group_by(Month) |> 
  dplyr::summarise(sum_price=sum(quantity*Price)) |> 
  arrange(desc(sum_price))

On average , May had the greatest amount of sales

0.8.2 what month is best for sales : SQL

SELECT strftime('%m',Date1) AS month,
       strftime('%Y',Date1) AS year,
       SUM(quantity*Price) AS total_price
FROM sales_df 
GROUP BY month
HAVING strftime('%Y',Date1) = '2019'
ORDER BY total_price DESC;

Displaying records 1 - 10
month	year	total_price
12	2019	4613443
10	2019	3736727
04	2019	3390670
11	2019	3199603
05	2019	3152607
03	2019	2807100
07	2019	2647776
06	2019	2577802
08	2019	2244468
02	2019	2202022

0.9

0.9.1 what time should we display advertisements : R

sales_df |> 
  filter(year==2019) |> 
  group_by(hour) |> 
  dplyr::summarise(sum_price=sum(Price*quantity)) |> 
  arrange(desc(sum_price))

we have like 24 hours arranged in assending where 1-24 is increasing order of time from early moring to midnight

sales_df |> 
  filter(year==2019) |> 
  group_by(hour) |> 
  dplyr::summarise(sum_price=sum(Price*quantity)) |> 
  arrange(desc(sum_price)) |> 
  ggplot(aes(x=hour,sum_price))+
  geom_point()+
  geom_line()+
  tvthemes::theme_simpsons()

we notice a huge peek between 10am and 1pm then between 6pm and 8pm

0.9.2 what time should we display advertisements : SQL

SELECT strftime('%Y',Date1) AS year,
       strftime('%H',Date1) AS hour,
       SUM(quantity*Price) AS total_price
FROM sales_df 
GROUP BY hour
HAVING strftime('%Y',Date1) = '2019'
ORDER BY total_price DESC;

Displaying records 1 - 10
year	hour	total_price
2019	19	2412939
2019	12	2316821
2019	11	2300610
2019	20	2281716
2019	18	2219348
2019	13	2155390
2019	17	2129362
2019	14	2083673
2019	21	2042001
2019	10	1944287

Extracting useful information using R and SQL

Bongani Ncube

11/27/2023

0.0.1 Load in required packages

0.0.2 Read in data

0.0.3 What variables do we have

0.0.4 Objectives only for

0.0.5 look at the data types

0.0.6 look at missing data

0.0.7 whats wrong with the data

0.0.8 Glimpse at the data

0.0.9 do we have any duplicates?

0.0.10 Extracting more information from the data

0.0.11 lets look at address column

0.1 Storing as database

0.2

0.2.1 cities recorded : R

0.2.2 cities recorded : SQL

0.3

0.3.1 Number of Orders per city: R

0.3.2 Number of Orders per city: SQL

0.4

0.4.1 mean sales per city: R

0.4.2 mean sales per city: SQL

0.5

0.5.1 what are the top 10 products sold in quantity : R

0.5.2 what are the top 10 products sold in quantity : SQL

0.6

0.6.1 what are the top 10 products in sales : R

0.6.2 what are the top 10 products in sales : SQL

0.7 Aggregating by day and time only focusing on 2019

0.8

0.8.1 what month is best for sales : R

0.8.2 what month is best for sales : SQL

0.9

0.9.1 what time should we display advertisements : R

0.9.2 what time should we display advertisements : SQL