After spending quite a while with R, I’m convinced it is the most enjoyable tool that I have found yet to manipulate large amounts of data. Yes, I can agree that the level of information hiding is annoying sometimes given other languages out there, but I think that the results justify the means most of the time: you can crunch, store, visualize and analyze unbeliebvable amounts of data very quickly. My favorite integration thus far is MySQL with R-Studio, which I’ll talk more about below.
First question: “Why would I ever use R Studio to write MySQL? I have a serparate IDE for that and it works just fine.” “Granted. However, what if you had a few problems like these:”
What do my last few years sales data look like by city, neighborhood and zipcode?
What are the important predictors of whether someone will buy my product?
Can you build a model to predict whether someone 18-35 will use my product so I can market to them efficiently?
Most often I encounter three different solutions to these three problems. For question one I might use a third party visualization system that reads in data from the server into dashboards. For question two I need statistical power, so I default to excel to run models and this has it’s limits. For question three I need a more advanced statistical solution that has robust off the shelf algorithms and also gives me the ability to write my own if necessary. I’ve found that with R I don’t have to leave the environment and can accomplish all of these. A few of which I’ll demonstrate now.
Company XYZ has a database that contains information related to it’s internet media store. XYZ company sell music all over the world in 100’s of different languages and in every genre you can think of. However, there is a problem: XYZ company has no idea who buys what genres/types, how much they buy it, where and why. Can you help?
Load the necessary packages and connect direclty with the MySQL server using R. We do this like below:
# Load the necessary packages
library(DBI)
library(RSQLite)
library(leaflet)
library(ggplot2)
library(dplyr)
# Connect to the sqlite file
con = dbConnect (RSQLite::SQLite(), dbname = "~/Desktop/R Project/SQL Lite/funWithMYSQL/R-code-for-SQL-Integration-/Chinook_Sqlite.sqlite")
For easy reproducbility and security I’ve chosen to use a database as a file. The connection logic though is exactly the same. If you’d like to run this database as a file yourself you can access it via my github: MyGithub
As we’re now in R we can write clever things like this query that gets the names of all the tables, return the top 5 lines of every table for context (for advanced JOINS mainly) adds names to all columns, and stores it in dataframe for easy retrieval.
# get a list of all tables and view the tables
alltables = dbListTables(con)
# Turn all tables into Data Frames
A <- list()
for (i in 1:length(alltables)) {
A[[i]] <- dbGetQuery(con, paste0("SELECT * FROM ", alltables[i], " LIMIT 5"))
}
# Name A properly. Use A as a way to preview top line items from large DBs.
names(A) <- alltables
Why is this so useful? Well, let’s say you’re forgetful like me and want to recall quickly the layout of part of or the entire database. We can just write like so:
A$Artist
## ArtistId Name
## 1 1 AC/DC
## 2 2 Accept
## 3 3 Aerosmith
## 4 4 Alanis Morissette
## 5 5 Alice In Chains
Or you’d like a quick reminder of which tables are here and their names:
alltables
## [1] "Album" "Artist" "Click" "Clicks"
## [5] "Customer" "Employee" "Genre" "Invoice"
## [9] "InvoiceLine" "MediaType" "Playlist" "PlaylistTrack"
## [13] "Track" "workingMem"
This is all very interesting, but what we’re really here to do is show how storing, viewing, & manipulating the data returned from SQL is great in R Studio. Let’s assume that XYZ company’s sales department comes in and wants to do a compaign around classial music (who doesn’t like Franz Liszt??). They ask to know which cities in the database are the largest consumers of classical tracks? So you do as follows:
# Which cities in the database are top consumers of classical music?
Classical_Lovers <- dbGetQuery(con, "SELECT Invoice.BillingCity, SUM(InvoiceLine.Quantity) AS Count
FROM Invoice
JOIN InvoiceLine ON (Invoice.InvoiceId = InvoiceLine.InvoiceId)
JOIN Track ON (InvoiceLine.TrackId = Track.TrackId)
JOIN Genre ON (Track.GenreId = Genre.GenreId)
WHERE Genre.Name = 'Classical'
GROUP BY Invoice.BillingCity
ORDER BY Count DESC
Limit 50")
The syntax of the query is left unchanged. It’s written directly into R and we store the result into a dataframe. Then using that data we spin out a quick plot of the number of classical music songs sold in each billing city.
Next let’s say that you’re asked if, all else equal, songs that are longer in duration tend to be more popular than songs that are shorter. You decide to do a quick investiagation and find out if that might be true by testing your own data:
Longer_Better <- dbGetQuery(con, "SELECT COUNT(InvoiceLine.TrackId) AS Sales, Track.Milliseconds
FROM InvoiceLine
JOIN Track ON (InvoiceLine.TrackId = Track.TrackId)
GROUP BY InvoiceLine.TrackID ")
You realize that given the limited amounts of time you’ve been collecting data you have only two states, “buy more” or “buy less”. Given the small set, binary outcome, and question of liklihood you fit a logistic model. It gives you the following:
##
## Call:
## glm(formula = as.factor(Sales) ~ Milliseconds, family = "binomial",
## data = Longer_Better)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5576 -0.5379 -0.5328 -0.5214 2.3801
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.780e+00 8.864e-02 -20.084 <2e-16 ***
## Milliseconds -3.689e-07 1.794e-07 -2.056 0.0398 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1525.9 on 1983 degrees of freedom
## Residual deviance: 1520.5 on 1982 degrees of freedom
## AIC: 1524.5
##
## Number of Fisher Scoring iterations: 5
Your analysis is not trying to be robust by any means, but it seems as if there is a small, significant inverse relationship between the length of the song and customer sales of those songs. Maybe our attention spans are getting shorter…
Lastly your’re asked to present a map to XYZ company of where the US customers purchase from, by city, pulled directly from the database. Your map should be fully interactive, fast to create and easy to use. You create the following using the same logic as before: query, store, clean, & visualize.
This is just a small sample of a number of things that can be accomplished using R Studio and SQL. You can also update tables, create tables and any other methods allowed by your version of SQL directly from R Studio.