Overview

The sqldf package provides an easy way to execute Series Query Language (SQL) statements in R. According to the sqldf github page, the package is optimized for convenience, implying that it is intended for people with prior knowledge of SQL.

Structured Query Language, or SQL, is a language developed by IBM used to access and manipulate databases. Many of the world’s most prominent and important database systems use SQL in addition to their own proprietary relational database management systems (RDBMS). Before SQL, people would have to use read-write Application Programming Interfaces to access digital records, but with SQL, you can access multiple records with minimal commands and simple syntax. Common applications of SQL commands include executing queries, retrieving data, inserting, updating, and deleting records, and creating databases. SQL is not designed for statistical analysis, it is simply for data management.

Usage

The only function included in the sqldf package is sqldf(). Although the function is optimized for convenience, in some cases the sqldf function “can be faster than the corresponding pure R calculation.” The function takes R dataframe(s) in place of standard SQL table name(s) and automatically creates an appropriate database. sqldf() executes the specified SQL commands upon said database and returns an R dataframe. The database sqldf() created is deleted making the entire process “transparent” to the user.

While there is only one function included in the sqldf package, there are many SQL keywords and functions available to the user through the package.

  • SELECT: the primary function sqldf refers to. Specifies the columns in the database that the user wants to query

  • FROM: specifies which table to select or delete data from

  • WHERE: filters results to include only records that satisfy the given condition

  • DELETE: deletes rows from a table

  • DROP: deletes a column, constraint, database, index, table, or view

  • GROUP BY: groups rows that have the same values into summary rows, frequently used with aggregate functions, such as COUNT(), MAX(), MIN(), SUM(), and AVG().

  • ORDER BY: used to sort the result-set in ascending or descending order, ascending is default so use the DESC keyword to sort in descending order

  • CASE: similar to a series of if, else if, and else statements in R. Case statements go through a set of conditions and execute the relevant code once a condition is met.

    • WHEN: part of a CASE statement, the explicit condition to be met
    • THEN: follows a WHEN statement, the code to execute if the condition is met
    • ELSE: what to do if no condition is met, if there is no ELSE statement and no condition is met, the series returns NULL
    • END: signifies the end of a CASE statement
  • JOIN: there are many different methods to join SQL tables together, here are two common examples (note: sqldf does not support right join, the equivalent is to swap the table order and use LEFT JOIN)

    • INNER JOIN: selects records that have matches in both tables and merges specified together, must specify which column to merge on in each table

    • LEFT JOIN: The result of LEFT JOIN shall be the same as the result of INNER JOIN + we’ll have rows, from the “left” table, without a pair in the “right” table

    • CROSS JOIN: this command acts as almost as a Cartesian join function, generating a paired combination of each row for specified columns from two tables.

A visual example of CROSS JOIN

Package Dependencies

  • Gsubfn: The gsubfn function is like gsub but can take a replacement function or certain other objects instead of the replacement string. Gsubfn can also be used to split strings on content rather than requiring a delimiter. The function has the ability to translate formulas to functions and allows formulas in function calls instead of functions. This is especially useful in SQL calls that involve strings where substitution is desirable.
  • Proto: An object oriented system using object-based, or prototype-based, object oriented ideas. The one function, ‘proto’ creates or modifies objects of the proto system.
  • RSQLite: RSQLite package will integrate SQLite into R running in RStudio. SQLite is a public-domain, single-user, very light-weight database engine that implements a subset of the SQL standard, including the core table creation, updating, insertion, and selection operations, plus transaction management.

Some Applications of SQLDF

Here are some examples of how one uses SQLDF(), in the context of the recent NBA clustering lab.

FilteredSelect - Finding the Grandpas of the NBA

colnames(salaries) <- c("Player", "Salary")

# uses SELECT to define the attributes we will demonstrate for final tibble and what columns we will have 

# chooses FROM nba2021

# Instead of using maybe a mutate pipe and filter function, we simply say: WHERE Age >= 32 

starters <- sqldf("SELECT Player, Pos, Age, Tm as Team, G, GS
                  FROM nba2021 WHERE Age >= 32")
datatable(starters)

FilteredCount - Teams’ Player Totals

# selects the Tm column 

# uses "as" to rename the Tm variable as "Team" for eventual DataTable

# COUNT is used here to count up instances of different string abbreviations for teams 

# WHERE is once again used here to filter; this time excluding TOT values for the Team Variable 

# Finds total number of players for each team and groups by team name 
num.players <- sqldf("SELECT Tm as Team, COUNT(Tm) as 'Total Players'
                     FROM nba2021
                     WHERE Tm != 'TOT'
                     GROUP BY Tm")
datatable(num.players)

Select/GroupBy Queries - Average Salary of Each Team for the 2020-2021 Seasons

#Uses SELECT to pick columns of Tm & Salary 

# Uses ROUND and AVG to find mean Salary for each Tm from a

# LEFT JOIN Initializes the joining of the tables and ON indicates where the Join should happen (where a.Player = b.Player)

# after all this, it ends up finding the average salary of each team for the 2020-2021

team.salary <- sqldf("SELECT a.Tm as Team, ROUND(AVG(b.Salary), 2) as 'Mean Salary ($)'
                     FROM nba2021 a
                     LEFT JOIN salaries b ON a.Player = b.Player
                     GROUP BY a.Tm
                     HAVING a.Tm != 'TOT'
                     ")
datatable(team.salary)

Cross Join - The Marginal Cost of Field Goals for Each Team in terms of Salary

# Rounds the sum of the salary column divided by the sum of the points column

# CROSS JOIN salaries as "b" and nba2021 databases on Player column

# Orders in terms of cost per point column values in Ascending order using ORDER BY and ASC


crossjoin <- sqldf("Select a.Tm, ROUND((SUM(b.Salary)/ SUM(a.Pts)), 2) as 'Cost per Point' 
      FROM nba2021 a
      CROSS JOIN salaries b ON a.Player = b.Player
      GROUP BY a.Tm 
      HAVING a.TM != 'TOT'
      ORDER BY ROUND((SUM(b.Salary) / SUM(a.Pts)), 2) ASC")
datatable(crossjoin)

Similar Packages

Reticulate. Reticulate provides a set of tools that allow for calling Python from R in Rmarkdown. It allows for sourcing Python scripts, Python Modules, and translation between R and Python Objects. This can be particularly useful for teams who are collaborating within R but do not all use R…

Similarities to sqldf:

  • allows for the convenient implementation of a non-native language within the R environment

Differences to sqldf:

  • Reticulate can allow for an interactive Python console within R. Objects you create within Python are available to your R session, which is not necessarily guaranteed with the use of the sqldf package alone.

  • When the Reticulate package is accessed, Output within R can be presented in python format, whereas the sqldf package only allows a user to implement SQL formatted commands for output that is still styled in the standard RStudio format.

Rcpp. Rcpp provides efficient integration of C++ and R by providing its own API over R’s C API. This can allow for many benefits for the C++ programmer using R, such as the more efficient use of recursive functions and the use of advanced algorithms that R cannot provide.

Similarities to sqldf:

  • allows for the convenient implementation of a non-native language within the R environment

Differences to sqldf:

  • Rather than literally typing SQL commands into the R environment as sqldf does, Rccp will compile C++ code and construct R functions that connects to the compiled C++ function in question.

  • It does not require the use of assignment to create functions.

Reflection

Overall, sqldf is a very convenient and effective package that can be used in a lot of ways. Using the sqldf() function, we can tackle both data preperation and data analysis, making this package very versatile. However, all things must have benefits and drawbacks associated with them. After our own experience with the package, as well as further research in the community of sqldf users, we compiled lists of pros and cons associated with sqldf. The pros in this situation are plentiful, as the package is quite convenient and easy to use. First, sqldf() can manipulate data sets at times be even faster than the corresponding pure R calculation. The process in running sqldf() is that it creates a temporary database to store and manipulate the data frame being studied, then returns the new data frame and removes that temporary database. In addition, it has been suggested that aggregations over highly granular columns using sqldf is faster than any other alternative tried. From this, we see that the package function is highly efficient. Another pro for using sqldf is that it can be used as a bridge between SQL and R. Those who do not know SQL, but know R, can use this package to learn the foundations of SQL statements. Furthermore, those who don’t know R, but know SQL, can use their SQL information and pair it with R commands and functions.

That being said, the package is not perfect. There are problems that some users experience, as well as some odd features. First, the sqldf package uses the gsubfn package for parsing and the gsubfn package optionally uses the tcltk R package which in turn uses string processing language, tcl, internally. A lot of times, this can cause problems for those who aren’t using recent versions of R, so all that has to be done is make sure your R is updated. A second complaint is that SQL is case insensitive. R on the other hand, is case sensitive, so the difference in sensitivity can be a nuissance at times. But, using the functions within this package also make R coding case insensitive. Therefore, one must always be cognisant of this to avoid mistakes and errors running code under this package. Lastly, although data frames referenced in the SQL statement(s) passed to sqldf are automatically imported to SQLite, sqldf does not automatically export anything for safety reasons. Users who are constantly updating tables and datasets have expressed this as a point of complaint, as there is a specific way this must be done under this package.