The sqldf package provides an easy way to execute Series Query Language (SQL) statements in R. According to the sqldf github page, the package is optimized for convenience, implying that it is intended for people with prior knowledge of SQL.
Structured Query Language, or SQL, is a language developed by IBM used to access and manipulate databases. Many of the world’s most prominent and important database systems use SQL in addition to their own proprietary relational database management systems (RDBMS). Before SQL, people would have to use read-write Application Programming Interfaces to access digital records, but with SQL, you can access multiple records with minimal commands and simple syntax. Common applications of SQL commands include executing queries, retrieving data, inserting, updating, and deleting records, and creating databases. SQL is not designed for statistical analysis, it is simply for data management.
The only function included in the sqldf package is sqldf(). Although the function is optimized for convenience, in some cases the sqldf function “can be faster than the corresponding pure R calculation.” The function takes R dataframe(s) in place of standard SQL table name(s) and automatically creates an appropriate database. sqldf() executes the specified SQL commands upon said database and returns an R dataframe. The database sqldf() created is deleted making the entire process “transparent” to the user.
While there is only one function included in the sqldf package, there are many SQL keywords and functions available to the user through the package.
In order to best demonstrate the functionality of sqldf and SQL in general, below are a few operations querying and manipulating data from the recent NBA clustering lab.
# FIND ALL PLAYERS WHO HAVE STARTED EVERY GAME THEY HAVE PLAYED IN.
# SELECT player name, position, age, team, games played, and games started
# FROM the "stats" tibble
# WHERE games played, G, is equal to games started, GS
starters <- sqldf("SELECT Player, Pos, Age, Tm as Team, G, GS
FROM stats WHERE G == GS")
datatable(starters)
# FIND THE TOTAL NUMBER OF PLAYERS WHO PLAYED FOR
# EACH TEAM IN THE 2020-2021 SEASON
# SELECT the "Tm" column and name it as "Team."
# COUNT the occurrences of each team in Tm, set name to "Total Players."
# FROM the stats database
# WHERE Tm is not equal to "TOT." TOT is included for players
# who played for more than one team (the aggregation of the player's stats)
# GROUP BY team name
num.players <- sqldf("SELECT Tm as Team, COUNT(Tm) as 'Total Players'
FROM stats
WHERE Tm != 'TOT'
GROUP BY Tm")
datatable(num.players)
# FIND THE AVERAGE SALARY OF EACH TEAM FOR THE SEASON
# SELECT the relevant columns Tm and Salary
# ROUND the AVG salary
# FROM stats which is denoted by "a"
# LEFT JOIN salaries, denoted by "b", to stats on Player name
# GROUP BY a.Tm
# HAVING a.Tm not equal to "TOT," which is the stat aggregate for traded players
team.salary <- sqldf("SELECT a.Tm as Team, ROUND(AVG(b.Salary), 2) as 'Mean Salary ($)'
FROM stats a
LEFT JOIN salaries b ON a.Player = b.Player
GROUP BY a.Tm
HAVING a.Tm != 'TOT'
")
datatable(team.salary)
# FIND THE DOLLAR COST OF EACH POINT SCORED BY AN NBA TEAM
# SELECT Tm, Salary, and Pts columns
# ROUND the sum of the salary column divided by the sum of the points column
# FROM stats database set as "a"
# INNER JOIN salaries as "b" and stats databases on Player column
# GROUP BY Tm
# HAVING each Tm not equal to 'TOT'
# ORDER BY cost per point column
# DESC descending order
point.cost <- sqldf("SELECT a.Tm, ROUND((SUM(b.Salary) / SUM(a.Pts)), 2) as 'Cost per Point'
FROM stats a
INNER JOIN salaries b ON a.Player = b.Player
GROUP BY a.Tm
HAVING a.Tm != 'TOT'
ORDER BY ROUND((SUM(b.Salary) / SUM(a.Pts)), 2) DESC
")
datatable(point.cost)
With the scope of R packages that are in existence, there must be some that are similar to ours. After some research, we were able to find some packages that showed at least some resemblance to sqldf. The first is Reticulate. Reticulate provides a set of tools that allow for calling Python from R in Rmarkdown. It allows for sourcing Python scripts, Python Modules, and translation between R and Python Objects. This package would be particularly useful for teams who are collaborating within R but do not all use R. This is similar to sqldf in that it allows for the convenient implementation of a non-native language within the R environment. In the same way that sqldf allows for R users to implement SQL, Reticulate allows R users to impleement Python code. However, Reticulate differs itself from sqldf, as Reticulate can allow for an interactive Python console within R. Objects you create within Python are available to your R session, which is not necessarily guaranteed with the use of the sqldf package alone. Also, when the Reticulate package is accessed, output within R can be presented in python format, whereas the sqldf package only allows a user to implement SQL formatted commands for output that is still styled in the standard RStudio format.
A second package we found to be similar is Rcpp. provides efficient integration of C++ and R by providing its own API over R’s C API. This can allow for many benefits for the C++ programmer using R, such as the more efficient use of recursive functions and the use of advanced algorithms that R cannot provide. Just like as in sqldf and in Reticulate, Rcpp allows for the convenient implementation of a non-native language within the R environment. Rcpp utilizes C++ code in the same manner that sqldf utilizes SQL code. However, we can differentiate the two packages, as Rccp will compile C++ code and construct R functions that connects to the compiled C++ function in question, rather than literally typing SQL commands into the R environment as sqldf does. In addition, Rcpp does not require the use of assignment to create functions.
Overall, sqldf is a very convenient and effective package that can be used in a lot of ways. Using the sqldf() function, we can tackle both data preperation and data analysis, making this package very versatile. However, all things must have benefits and drawbacks associated with them. After our own experience with the package, as well as further research in the community of sqldf users, we compiled lists of pros and cons associated with sqldf. The pros in this situation are plentiful, as the package is quite convenient and easy to use. First, sqldf() can manipulate data sets at times be even faster than the corresponding pure R calculation. The process in running sqldf() is that it creates a temporary database to store and manipulate the data frame being studied, then returns the new data frame and removes that temporary database. In addition, it has been suggested that aggregations over highly granular columns using sqldf is faster than any other alternative tried. From this, we see that the package function is highly efficient. Another pro for using sqldf is that it can be used as a bridge between SQL and R. Those who do not know SQL, but know R, can use this package to learn the foundations of SQL statements. Furthermore, those who don’t know R, but know SQL, can use their SQL information and pair it with R commands and functions.
That being said, the package is not perfect. There are problems that some users experience, as well as some odd features. First, the sqldf package uses the gsubfn package for parsing and the gsubfn package optionally uses the tcltk R package which in turn uses string processing language, tcl, internally. A lot of times, this can cause problems for those who aren’t using recent versions of R, so all that has to be done is make sure your R is updated. A second complaint is that SQL is case insensitive. R on the other hand, is case sensitive, so the difference in sensitivity can be a nuissance at times. But, using the functions within this package also make R coding case insensitive. Therefore, one must always be cognisant of this to avoid mistakes and errors running code under this package. Lastly, although data frames referenced in the SQL statement(s) passed to sqldf are automatically imported to SQLite, sqldf does not automatically export anything for safety reasons. Users who are constantly updating tables and datasets have expressed this as a point of complaint, as there is a specific way this must be done under this package.