Introduction

In this document, we will simulate a typical exercise of pulling data from a sqlite database and manipulating them in R. We will first create 3 tables in a sqlite database and then query them. We will follow the below steps:

  1. Create SQLite database
  2. Create 3 tables
  3. Query data from SQLite database
  4. Manipulate data

To facilitate this exercise, we will use three .csv datasets saved on github. Both of these three datasets are based on fictitious data.

The overall goals of this exercise are:

Getting Started

We will first load a few helper libraries and the three datasets from github: customers, calls, and sales. The customers dataset has a list of all of our customers with their phone numbers and state of birth. The calls dataset has the number of calls by phone number and duration of calls. The sales dataset has our company’s sales to each customer.

library(DBI)
library(RSQLite)
library(tidyverse)

customers = read.csv("https://raw.githubusercontent.com/mugishajean/merge-data/master/Customers.csv")
head(customers)
##        Phone      Fname     Lname State
## 1 5553334444 FirstName1 LastName1    CA
## 2 5553334445 FirstName2 LastName2    NY
## 3 5553334446 FirstName3 LastName3    NJ
## 4 5553334447 FirstName4 LastName4    CA
## 5 5553334448 FirstName5 LastName5    NC
## 6 5553334449 FirstName6 LastName6    FL
calls = read.csv("https://raw.githubusercontent.com/mugishajean/merge-data/master/calls.csv")
head(calls)
##            Phone Calls Minutes
## 1   555-333-4444     3      20
## 2 (555) 333 4445     4      30
## 3 5553334446-xxx     5      40
## 4   (555)3334447     6      50
## 5    555)3334448     7      60
## 6   (555)3334449     8      70
sales = read.csv("https://raw.githubusercontent.com/mugishajean/merge-data/master/sales.csv")
head(sales)
##      Product  Category                  Name Sales
## 1      apple     Fruit LastName1, FirstName1  1000
## 2     banana     Fruit LastName6, FirstName6   400
## 3 strawberry     Fruit LastName4, FirstName4   100
## 4         TV Equipment LastName5, FirstName5 10000
## 5      Radio Equipment LastName6, FirstName6  6000
## 6      Table Furniture LastName7, FirstName7  2100

Notice above the extra characters in calls phone numbers. We will need to transform them in order to merge calls with other datasets.

Create SQLite database and three tables

We will now create a sqlite database called ourcompany and save the three datasets.

createDB <- dbConnect(RSQLite::SQLite(), "ourcompany.sqlite")
dbWriteTable(createDB, "customers", customers)
dbWriteTable(createDB, "calls", calls)
dbWriteTable(createDB, "sales", sales)
dbListTables(createDB)
## [1] "calls"     "customers" "sales"

Query data from sqlite database

Now that we have created the sqlite database, we can move on to querying the tables. We will load all observations in R.

mysales = dbGetQuery(createDB, 'SELECT * FROM sales')
mycustomers = dbGetQuery(createDB, 'SELECT * FROM customers')
mycalls = dbGetQuery(createDB, 'SELECT * FROM calls')

Manipulate data

We will now proceed to transform the data and get the top 3 customers based on total sales.

mysales = mysales %>% 
  group_by(Category, Name) %>%
  summarise(Totalsales = sum(Sales)) %>%
  top_n(n=-1, wt = Name)

mysales
## # A tibble: 3 x 3
## # Groups:   Category [3]
##   Category  Name                  Totalsales
##   <chr>     <chr>                      <int>
## 1 Equipment LastName1, FirstName1      31642
## 2 Fruit     LastName1, FirstName1       1000
## 3 Furniture LastName5, FirstName5       7654

In order to merge with customers, we need to format the names so that they match sales. We will create a new column Name

mycustomers = mycustomers %>%
  mutate(Name = paste(Lname, Fname, sep = ", "))

Next, we will remove the extra characters in calls:

mycalls = mycalls %>%
  mutate(Phone = as.numeric(gsub("[-()x\\[:space:]]", "", Phone)))

head(mycalls)
##        Phone Calls Minutes
## 1 5553334444     3      20
## 2 5553334445     4      30
## 3 5553334446     5      40
## 4 5553334447     6      50
## 5 5553334448     7      60
## 6 5553334449     8      70

Finally, we can merge sales and customers by Name and calls by Phone

mytop = left_join(mysales, mycustomers[, c(1,4,5)], by = "Name") %>%
  left_join(mycalls, by="Phone")

mytop
## # A tibble: 3 x 7
## # Groups:   Category [?]
##   Category  Name                  Totalsales      Phone State Calls Minutes
##   <chr>     <chr>                      <int>      <dbl> <chr> <int>   <int>
## 1 Equipment LastName1, FirstName1      31642 5553334444 CA        3      20
## 2 Fruit     LastName1, FirstName1       1000 5553334444 CA        3      20
## 3 Furniture LastName5, FirstName5       7654 5553334448 NC        7      60

Conclusion

In this document, we sought to demonstrate how to create a simple sqlite database and query data from it. We also show how to clean phone numbers and to get the top 3 customers.