Overview

The sqldf package provides an easy way to execute Series Query Language (SQL) statements in R. According to the sqldf github page, the package is optimized for convenience, implying that it is intended for people with prior knowledge of SQL.

Structured Query Language, or SQL, is a language developed by IBM used to access and manipulate databases. Many of the world’s most prominent and important database systems use SQL in addition to their own proprietary relational database management systems (RDBMS). Before SQL, people would have to use read-write Application Programming Interfaces to access digital records, but with SQL, you can access multiple records with minimal commands and simple syntax. Common applications of SQL commands include executing queries, retrieving data, inserting, updating, and deleting records, and creating databases. SQL is not designed for statistical analysis, it is simply for data management.

Usage

The only function included in the sqldf package is sqldf(). Although the function is optimized for convenience, in some cases the sqldf function “can be faster than the corresponding pure R calculation.” The function takes R dataframe(s) in place of standard SQL table name(s) and automatically creates an appropriate database. sqldf() executes the specified SQL commands upon said database and returns an R dataframe. The database sqldf() created is deleted making the entire process “transparent” to the user.

While there is only one function included in the sqldf package, there are many SQL keywords and functions available to the user through the package.

  • SELECT: the primary function sqldf refers to. Specifies the columns in the database that the user wants to query
  • FROM: specifies which table to select or delete data from
  • WHERE: filters results to include only records that satisfy the given condition
  • DELETE: deletes rows from a table
  • DROP: deletes a column, constraint, database, index, table, or view
  • GROUP BY: groups rows that have the same values into summary rows, frequently used with aggregate functions, such as COUNT(), MAX(), MIN(), SUM(), and AVG().
  • ORDER BY: used to sort the result-set in ascending or descending order, ascending is default so use the DESC keyword to sort in descending order
  • CASE: similar to a series of if, else if, and else statements in R. Case statements go through a set of conditions and execute the relevant code once a condition is met.
    • WHEN: part of a CASE statement, the explicit condition to be met
    • THEN: follows a WHEN statement, the code to execute if the condition is met
    • ELSE: what to do if no condition is met, if there is no ELSE statement and no condition is met, the series will return NULL
    • END: signifies the end of a CASE statement
  • JOIN: there are many different methods to join SQL tables together, here are two common examples (note: sqldf does not support right join, the equivalent is to swap the table order and use LEFT JOIN)
    • INNER JOIN: selects records that have matches in both tables and merges specified together, must specify which column to merge on in each table
    • LEFT JOIN: The result of LEFT JOIN shall be the same as the result of INNER JOIN + we’ll have rows, from the “left” table, without a pair in the “right” table

Package Dependencies

  • Gsubfn: The gsubfn function is like gsub but can take a replacement function or certain other objects instead of the replacement string. Gsubfn can also be used to split strings on content rather than requiring a delimiter. The function has the ability to translate formulas to functions and allows formulas in function calls instead of functions. This is especially useful in SQL calls that involve strings where substitution is desirable.
  • Proto: An object oriented system using object-based, or prototype-based, object oriented ideas. The one function, ‘proto’ creates or modifies objects of the proto system.
  • RSQLite: RSQLite package will integrate SQLite into R running in RStudio. SQLite is a public-domain, single-user, very light-weight database engine that implements a subset of the SQL standard, including the core table creation, updating, insertion, and selection operations, plus transaction management.

Usage Examples: NBA Breakdown

In order to best demonstrate the functionality of sqldf and SQL in general, below are a few operations querying and manipulating data from the recent NBA clustering lab.

Initial Datatables

Player Stats

Player Salaries

Only Starters

# FIND ALL PLAYERS WHO HAVE STARTED EVERY GAME THEY HAVE PLAYED IN.

# SELECT player name, position, age, team, games played, and games started
# FROM the "stats" tibble
# WHERE games played, G, is equal to games started, GS
starters <- sqldf("SELECT Player, Pos, Age, Tm as Team, G, GS
                  FROM stats WHERE G == GS")

datatable(starters)

Teams’ Player Totals

# FIND THE TOTAL NUMBER OF PLAYERS WHO PLAYED FOR 
# EACH TEAM IN THE 2020-2021 SEASON

# SELECT the "Tm" column and name it as "Team."
# COUNT the occurrences of each team in Tm, set name to "Total Players."
# FROM the stats database
# WHERE Tm is not equal to "TOT." TOT is included for players
# who played for more than one team (the aggregation of the player's stats)
# GROUP BY team name
num.players <- sqldf("SELECT Tm as Team, COUNT(Tm) as 'Total Players'
                     FROM stats
                     WHERE Tm != 'TOT'
                     GROUP BY Tm")

datatable(num.players)

Every Team’s Average Salary

# FIND THE AVERAGE SALARY OF EACH TEAM FOR THE SEASON

# SELECT the relevant columns Tm and Salary
# ROUND the AVG salary
# FROM stats which is denoted by "a"
# LEFT JOIN salaries, denoted by "b", to stats on Player name
# GROUP BY a.Tm
# HAVING a.Tm not equal to "TOT," which is the stat aggregate for traded players
team.salary <- sqldf("SELECT a.Tm as Team, ROUND(AVG(b.Salary), 2) as 'Mean Salary ($)'
                     FROM stats a
                     LEFT JOIN salaries b ON a.Player = b.Player
                     GROUP BY a.Tm
                     HAVING a.Tm != 'TOT'
                     ")

datatable(team.salary)

Teams’ Cost per Point

# FIND THE DOLLAR COST OF EACH POINT SCORED BY AN NBA TEAM

# SELECT Tm, Salary, and Pts columns
# ROUND the sum of the salary column divided by the sum of the points column
# FROM stats database set as "a"
# INNER JOIN salaries as "b" and stats databases on Player column
# GROUP BY Tm
# HAVING each Tm not equal to 'TOT'
# ORDER BY cost per point column
# DESC descending order
point.cost <- sqldf("SELECT a.Tm, ROUND((SUM(b.Salary) / SUM(a.Pts)), 2) as 'Cost per Point'
                    FROM stats a
                    INNER JOIN salaries b ON a.Player = b.Player
                    GROUP BY a.Tm
                    HAVING a.Tm != 'TOT'
                    ORDER BY ROUND((SUM(b.Salary) / SUM(a.Pts)), 2) DESC
                    ")
datatable(point.cost)

Similar Pacakgees

With the scope of R packages that are in existence, there must be some that are similar to ours. After some research, we were able to find some packages that showed at least some resemblance to sqldf. The first is Reticulate. Reticulate provides a set of tools that allow for calling Python from R in Rmarkdown. It allows for sourcing Python scripts, Python Modules, and translation between R and Python Objects. This package would be particularly useful for teams who are collaborating within R but do not all use R. This is similar to sqldf in that it allows for the convenient implementation of a non-native language within the R environment. In the same way that sqldf allows for R users to implement SQL, Reticulate allows R users to impleement Python code. However, Reticulate differs itself from sqldf, as Reticulate can allow for an interactive Python console within R. Objects you create within Python are available to your R session, which is not necessarily guaranteed with the use of the sqldf package alone. Also, when the Reticulate package is accessed, output within R can be presented in python format, whereas the sqldf package only allows a user to implement SQL formatted commands for output that is still styled in the standard RStudio format.

A second package we found to be similar is Rcpp. provides efficient integration of C++ and R by providing its own API over R’s C API. This can allow for many benefits for the C++ programmer using R, such as the more efficient use of recursive functions and the use of advanced algorithms that R cannot provide. Just like as in sqldf and in Reticulate, Rcpp allows for the convenient implementation of a non-native language within the R environment. Rcpp utilizes C++ code in the same manner that sqldf utilizes SQL code. However, we can differentiate the two packages, as Rccp will compile C++ code and construct R functions that connects to the compiled C++ function in question, rather than literally typing SQL commands into the R environment as sqldf does. In addition, Rcpp does not require the use of assignment to create functions.

Reflection

Overall, sqldf is a very convenient and effective package that can be used in a lot of ways. Using the sqldf() function, we can tackle both data preperation and data analysis, making this package very versatile. However, all things must have benefits and drawbacks associated with them. After our own experience with the package, as well as further research in the community of sqldf users, we compiled lists of pros and cons associated with sqldf. The pros in this situation are plentiful, as the package is quite convenient and easy to use. First, sqldf() can manipulate data sets at times be even faster than the corresponding pure R calculation. The process in running sqldf() is that it creates a temporary database to store and manipulate the data frame being studied, then returns the new data frame and removes that temporary database. In addition, it has been suggested that aggregations over highly granular columns using sqldf is faster than any other alternative tried. From this, we see that the package function is highly efficient. Another pro for using sqldf is that it can be used as a bridge between SQL and R. Those who do not know SQL, but know R, can use this package to learn the foundations of SQL statements. Furthermore, those who don’t know R, but know SQL, can use their SQL information and pair it with R commands and functions.

That being said, the package is not perfect. There are problems that some users experience, as well as some odd features. First, the sqldf package uses the gsubfn package for parsing and the gsubfn package optionally uses the tcltk R package which in turn uses string processing language, tcl, internally. A lot of times, this can cause problems for those who aren’t using recent versions of R, so all that has to be done is make sure your R is updated. A second complaint is that SQL is case insensitive. R on the other hand, is case sensitive, so the difference in sensitivity can be a nuissance at times. But, using the functions within this package also make R coding case insensitive. Therefore, one must always be cognisant of this to avoid mistakes and errors running code under this package. Lastly, although data frames referenced in the SQL statement(s) passed to sqldf are automatically imported to SQLite, sqldf does not automatically export anything for safety reasons. Users who are constantly updating tables and datasets have expressed this as a point of complaint, as there is a specific way this must be done under this package.