library(openintro)
library(tinytex)
library(tidyverse)

SQL and R - An Example

There are a number of ways to use SQL directly in R. SQL is extremely useful for manipulating data, especially when pulling data from multiple related tables. What follows is an example of how R may be used to load tables from csv files into a SQL Server database, and how SQL queries can then be run from within R. We will use the RODBC package,

Creating the Connection

[This update to the document was made on 2/14/20, before homework is due but after submission. This code relies on ODBC being installed on the host computer. ODBC is already installed on Windows. However, I was not aware that it may not be if you have a mac. Microsoft provides this documentation for how to install ODBC to connect to SQL Server on a mac: https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/install-microsoft-odbc-driver-sql-server-macos?view=sql-server-ver15]

We are connecting to a SQL database on Azure called EHtmp.The database will consist of two tables: Friends, a parent table with the name, age and gender of some friends and MovieRatings, a child table containing movie names and friend ratings on a scale of 1 to 5. Here we create the connection with RODBC.

This is the code - however, I am hiding the block so that the password doesn’t appear:

# install.packages (“RODBC”) - install RODBC if not already installed

library(RODBC)

dbConnection <- odbcDriverConnect(‘Driver={ODBC Driver 13 for SQL Server}; Server=[server]; Database=[db]; Uid=[uid]; Pwd=[pwd]; Encrypt=yes; TrustServerCertificate=no; Connection Timeout=30;’)

Now we load the table data from two CSV files that are on github.

dfFriends <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Movie_Friends.csv", header=TRUE)
dfMovieRatings <- read.csv("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Movies_MovieRatings.csv", header=TRUE)

Creating and Filling the Tables

Now we create and fill the two tables - Friends and MovieRatings. Because Friends is a parent table and MovieRatings a child table, MovieRatings contains FriendID as a foreign key.

RODBC makes it easy to create and populate the tables in the DB with the sqlSave command. We don’t even need to use the CREATE TABLE command. (BTW, if we needed to create the table ourselves we can still do it - this command would do the trick:

          sqlQuery(dbConnection,"
                CREATE TABLE Friends
                (
                    FriendID int,
                    LastName nvarchar(30),
                    FirstName nvarchar(30),
                    Age int,
                    Gender nvarchar(30),
                  CONSTRAINT PK_Friends PRIMARY KEY
                  (FriendID ASC)
                )")
FriendsTable <- (sqlQuery(dbConnection, "DROP TABLE IF EXISTS Friends"))
DropMovieRatingsTable <- (sqlQuery(dbConnection, "DROP TABLE IF EXISTS MovieRatings"))
  
PopulateFriendsTable <- (sqlSave(dbConnection, dfFriends, "Friends", append=TRUE))
PopulateFriendsTable
## [1] 1
PopulateMovieRatingsTable <- (sqlSave(dbConnection, dfMovieRatings, "MovieRatings", append=TRUE))
PopulateMovieRatingsTable
## [1] 1
ShowFriendsTable <- (sqlQuery(dbConnection, "SELECT * FROM Friends" ))
ShowMovieRatingsTable <- (sqlQuery(dbConnection, "SELECT * FROM MovieRatings" ))

ShowFriendsTable 
##    rownames FriendID       LastName FirstName Age Gender
## 1         1        1          Jones  Jennifer  53 Female
## 2         2        2         Lursch      Ruby  13 Female
## 3         3        3         Fitten      Eric  60   Male
## 4         4        4          Jones      Fred  50   Male
## 5         5        5        Johnson    Chucky  15   Male
## 6         6        6           Vexx    Kragen  56   Male
## 7         7        7  Van Der Groot    Gerald  80   Male
## 8         8        8          Lopez    Egland  47 Female
## 9         9        9          Quail    Carson  45 Female
## 10       10       10         Quemba     Royal  16   Male
ShowMovieRatingsTable 
##    rownames MovieID FriendID Overall_Rating              MovieName
## 1         1       1        3              5 Avengers Age of Ultron
## 2         2       2        3              5      The Fifth Element
## 3         3       3        3              5              Afterlife
## 4         4       4        3              4             Departures
## 5         5       5        4              5               MegaMind
## 6         6       6        4              1              Lady Bird
## 7         7       7        4              5          Maze Runner 1
## 8         8       8        4              2                Arrival
## 9         9       9        4              5 Avengers Age of Ultron
## 10       10      10        4              5      The Fifth Element
## 11       11      11        5              3               MegaMind
## 12       12      12        5              3              Lady Bird
## 13       13      13        5              5          Maze Runner 1
## 14       14      14        5              5 Avengers Age of Ultron
## 15       15      15        5              4      The Fifth Element
## 16       16      16        6              4               MegaMind
## 17       17      17        6              5              Lady Bird
## 18       18      18        6              2          Maze Runner 1
## 19       19      19        6              5                Arrival
## 20       20      20        6              1         Romeo + Juliet
## 21       21      21        6              1       Moonrise Kingdom
## 22       22      22        6              2              Afterlife
## 23       23      23        6              2             Departures
## 24       24      24        7              4              Afterlife
## 25       25      25        7              4             Departures
## 26       26      26        7              3                Arrival
## 27       27      27        7              3       Moonrise Kingdom
## 28       28      28        8              3              Lady Bird
## 29       29      29        8              3                Arrival
## 30       30      30        8              5         Romeo + Juliet
## 31       31      31        8              4       Moonrise Kingdom
## 32       32      32        8              5              Afterlife
## 33       33      33        9              5              Lady Bird
## 34       34      34        9              5          Maze Runner 1
## 35       35      35        9              2       Moonrise Kingdom
## 36       36      36        9              5 Avengers Age of Ultron
## 37       37      37        9              1      The Fifth Element
## 38       38      38       10              5               MegaMind
## 39       39      39       10              1              Lady Bird
## 40       40      40       10              5          Maze Runner 1
## 41       41      41        1              2          Maze Runner 1
## 42       42      42        1              5       Moonrise Kingdom
## 43       43      43        1              5         Romeo + Juliet
## 44       44      44        2              5               MegaMind
## 45       45      45        2              5              Lady Bird
## 46       46      46        2              2          Maze Runner 1
## 47       47      47        2              2 Avengers Age of Ultron
## 48       48      48        2              2      The Fifth Element
## 49       49      49        2              5         Romeo + Juliet
## 50       50      50        2              5             Departures

Handling Null Values

Because our tables are normalized, the fact that not all friends rated all movies is not a problem. For example, here is a query which calculates the average age of those who rated each movie, along with the number of raters for the movie:

MovieRatingstats <- (sqlQuery(dbConnection, "SELECT MovieRatings.MovieName as Movie_Name, Avg(Friends.Age) AS Avg_Age_of_Raters, Count(MovieRatings.MovieID) AS N
FROM MovieRatings INNER JOIN Friends ON MovieRatings.FriendID = Friends.FriendID
GROUP BY MovieRatings.MovieName
ORDER BY Avg(Friends.Age);"
))
MovieRatingstats
##                Movie_Name Avg_Age_of_Raters N
## 1                MegaMind                30 5
## 2               Lady Bird                34 7
## 3           Maze Runner 1                35 7
## 4  Avengers Age of Ultron                36 5
## 5       The Fifth Element                36 5
## 6          Romeo + Juliet                42 4
## 7              Departures                52 4
## 8        Moonrise Kingdom                56 5
## 9                 Arrival                58 4
## 10              Afterlife                60 4

However, null values will appear if we create a crosstab of friends and movies. Here we add a table to the DB based on a crosstab which features three movies - the missing ratings appear as an “NA”:

dfCrosstab <- as.data.frame(sqlQuery(dbConnection, "SELECT FriendID, [Arrival] AS Arrival, [Afterlife] AS Afterlife, [Departures] AS Departures
FROM 
   ( SELECT FriendID, MovieName, Overall_Rating
     FROM MovieRatings 
   ) ps
PIVOT
   ( SUM (Overall_Rating)
     FOR MovieName IN ( [Arrival], [Afterlife], [Departures])
   ) AS pvt;"))
dfCrosstab
##    FriendID Arrival Afterlife Departures
## 1         1      NA        NA         NA
## 2         2      NA        NA          5
## 3         3      NA         5          4
## 4         4       2        NA         NA
## 5         5      NA        NA         NA
## 6         6       5         2          2
## 7         7       3         4          4
## 8         8       3         5         NA
## 9         9      NA        NA         NA
## 10       10      NA        NA         NA
DropCrosstabTable <- (sqlQuery(dbConnection, "DROP TABLE IF EXISTS Crosstab"))
  
PopulateCrosstabTable <- (sqlSave(dbConnection, dfCrosstab, "Crosstab", append=TRUE))

If we want to get an average rating for Arrival, what can we do about the nulls? In some cases we would want to extrapolate or infer the data. In this case, it’s easy enough to filter the data out:

AverageRatingForArrival <- (sqlQuery(dbConnection, "SELECT Avg(Arrival) AS Avg_Rating
FROM Crosstab
WHERE Arrival is not null;"
))

AverageRatingForArrival
##   Avg_Rating
## 1          3

Running SQL Queries Within R

SQL allows us to look at the data in more complex ways. For example, this query looks at rating by gender:

dfMovieRatingsByGender <- as.data.frame((sqlQuery(dbConnection, "SELECT MovieRatings.MovieName, Friends.Gender, Count(MovieRatings.MovieID) AS CountOfMovieID, Avg(MovieRatings.OverAll_Rating) AS AvgOfOverall_Rating
FROM MovieRatings INNER JOIN Friends ON MovieRatings.FriendID = Friends.FriendID
GROUP BY MovieRatings.MovieName, Friends.Gender
ORDER BY MovieRatings.MovieName, Friends.Gender;
"
)))
dfMovieRatingsByGender
##                 MovieName Gender CountOfMovieID AvgOfOverall_Rating
## 1               Afterlife Female              1                   5
## 2               Afterlife   Male              3                   3
## 3                 Arrival Female              1                   3
## 4                 Arrival   Male              3                   3
## 5  Avengers Age of Ultron Female              2                   3
## 6  Avengers Age of Ultron   Male              3                   5
## 7              Departures Female              1                   5
## 8              Departures   Male              3                   3
## 9               Lady Bird Female              3                   4
## 10              Lady Bird   Male              4                   2
## 11          Maze Runner 1 Female              3                   3
## 12          Maze Runner 1   Male              4                   4
## 13               MegaMind Female              1                   5
## 14               MegaMind   Male              4                   4
## 15       Moonrise Kingdom Female              3                   3
## 16       Moonrise Kingdom   Male              2                   2
## 17         Romeo + Juliet Female              3                   5
## 18         Romeo + Juliet   Male              1                   1
## 19      The Fifth Element Female              2                   1
## 20      The Fifth Element   Male              3                   4
g <- ggplot(data = dfMovieRatingsByGender, aes(x = AvgOfOverall_Rating, y= MovieName, group=Gender, fill=Gender)) +
  geom_col(position = position_dodge()) +
  xlab("Average Rating") + ylab("Movie") +
  ggtitle("Average Movie Ratings By Gender")
g

I wasn’t surprised to see some of the more romantic movies favored by women (e.g. Romeo + Juliet) and more superhero movies favored by men (e.g. Age of Ultron), If I have a recommendation, it would be Departures, a beautiful Korean movie about a cellist-turned-undertaker. It had a high average rating but few rated it so it isn’t generally seen.

Conclusion

RODBC makes connecting and running queries on an ODBC-connectible database easy and straightforward. There is much more that can be done but this can get you started.