all images from imbd.com
This assignment demonstrates the process of loading information from a SQL database into an R dataframe. I chose six movies and asked friends on facebook to rate the movies on a scale of 1 to 5 (where 1=Strongly Dislikes, …, 3=Indifference, …, & 5=Strongly Liked the movies). The movies were recent & relatively popular movies that I had seen in the theater. It was difficult for me to chose movies, because I have rather eccentric taste and wanted to suggest titles that at least several of my friends had hopefully seen. There was one exception, however, I had not seen BlacKkKlansman. I just got too busy and never made it to the theater. Perhaps from surveying my friends on thier thoughts of the movie, I can make an estimate of how I would rate BlacKkKlansman?…..
The movies I chose were:
34 responses were received, however, of these, only 22 where considered for further analysis because ratings were given for at least half (>=3) of the movies.
With MySQL Workbench, the data was stored in a SQL database, ‘movie_ratings’, as a table, ‘FB_friend_ratings’.
The SQL script to load the data can be found here: DATA607: load_FB_friend_ratings.sql
Here we will look at two methods to load the data into RStudio:
First we load the necessary R libraries.
Next, we establish a connection to a MySQL database. The block of R code below shows the syntax for the connection. However, the actual line that is executed (and that contains my password) is hidden.
## <MySQLConnection:0,0>
Great!, we have established a connection and can now list the tables in the database ‘movie_ratings’
## [1] "FB_friend_ratings"
Now to create an R dataframe from the SQL table. We will do this by using a SELECT statement to select all the data in the table. If we were accessing a very large database, this would be an eccellent point to filter the data using a SELECT statement with more constraints.
#write our SELECT statement as a string
myQuery <- 'SELECT * FROM FB_friend_ratings;'
#apply out SELECT statement to the database we established a connection to.
ratings_df <- dbGetQuery( con, myQuery )
#display the head of the resulting R dataframe
head( ratings_df )## name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1 Adam 4 NA 4 4 5 5
## 2 Alex 3 4 2 4 5 NA
## 3 Ana 5 NA NA 5 NA 4
## 4 Ayanthi NA NA NA 4 5 NA
## 5 Ben 4 NA NA NA 4 3
## 6 Bonnie 4 5 4 NA 5 4
## [1] "name" "Black_Panther" "Parasite"
## [4] "Us" "BlacKkKlansman" "Get_Out"
## [7] "Star_Wars_Last_Jedi"
## [1] 22 7
We can see from the output of the above blocks, that the resulting dataframe has the expected features (columns) and dimensions.
MySQL Workbench provides some very user friendly features to export a data table to a .csv file. Through directions to do so are given in this MySQLTutorial The following code walks through the process of accessing the same data from the ‘FB_friend_ratings’ table that has been previously exported as a .csv file and uploaded to githut
#library(RCurl)
#create a string that holds the URL for the raw csv file
ratings_URL <- 'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/FB_friend_ratings.csv'
gitRatings <- read.csv( url( ratings_URL ) )
head( gitRatings )## name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1 Adam 4 NA 4 4 5 5
## 2 Alex 3 4 2 4 5 NA
## 3 Ana 5 NA NA 5 NA 4
## 4 Ayanthi NA NA NA 4 5 NA
## 5 Ben 4 NA NA NA 4 3
## 6 Bonnie 4 5 4 NA 5 4
## [1] "name" "Black_Panther" "Parasite"
## [4] "Us" "BlacKkKlansman" "Get_Out"
## [7] "Star_Wars_Last_Jedi"
## [1] 22 7
We just generated two R dataframes using different methods; let’s test to see if they are equivalent. We will do this using the comparedf() function from the arsenal library
## Loading required package: arsenal
##
##
## Table: Summary of data.frames
##
## version arg ncol nrow
## -------- ----------- ----- -----
## x gitRatings 7 22
## y ratings_df 7 22
##
##
##
## Table: Summary of overall comparison
##
## statistic value
## ------------------------------------------------------------ ------
## Number of by-variables 0
## Number of non-by variables in common 7
## Number of variables compared 6
## Number of variables in x but not y 0
## Number of variables in y but not x 0
## Number of variables compared with some values unequal 0
## Number of variables compared with all values equal 6
## Number of observations in common 22
## Number of observations in x but not y 0
## Number of observations in y but not x 0
## Number of observations with some compared variables unequal 0
## Number of observations with all compared variables equal 22
## Number of values unequal 0
##
##
##
## Table: Variables not shared
##
## | |
## |:-----------------------|
## |No variables not shared |
##
##
##
## Table: Other variables not compared
##
## var.x pos.x class.x var.y pos.y class.y
## ------ ------ -------- ------ ------ ----------
## name 1 factor name 1 character
##
##
##
## Table: Observations not shared
##
## | |
## |:--------------------------|
## |No observations not shared |
##
##
##
## Table: Differences detected by variable
##
## var.x var.y n NAs
## -------------------- -------------------- --- ----
## Black_Panther Black_Panther 0 0
## Parasite Parasite 0 0
## Us Us 0 0
## BlacKkKlansman BlacKkKlansman 0 0
## Get_Out Get_Out 0 0
## Star_Wars_Last_Jedi Star_Wars_Last_Jedi 0 0
##
##
##
## Table: Differences detected
##
## | |
## |:-----------------------|
## |No differences detected |
##
##
##
## Table: Non-identical attributes
##
## var.x var.y name
## ------ ------ -------
## name name class
## name name levels
The output above definitively shows that the two methods we used to load the SQL data to dataframes in RStudio yielded identical data structures.
To avoid confusion, the rest of this demo will make use of ‘rating_df’ Use the head() function to preview the dataframe again
## name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1 Adam 4 NA 4 4 5 5
## 2 Alex 3 4 2 4 5 NA
## 3 Ana 5 NA NA 5 NA 4
## 4 Ayanthi NA NA NA 4 5 NA
## 5 Ben 4 NA NA NA 4 3
## 6 Bonnie 4 5 4 NA 5 4
From the output above, we can see that there are a lot of missing data entries given by ‘NA’. This indicates that a person gave no rating for the movie because they have not seen it. There are many ways to deal with missing data points. An excellent overview is given in this ‘Dealing with Missing Data using R’ article.
A simple way to deal with the missing data is imputation with the mean. Basically, any ‘NA’ in a given column is replaced by the column’s mean. We will execute this in the following code:
#ratings_df[,2:7]
ratingsMeans <- sapply(ratings_df[,2:7], function( cl)
list(means=mean(cl,na.rm=TRUE)))
ratingsMeans## $Black_Panther.means
## [1] 4
##
## $Parasite.means
## [1] 4.818182
##
## $Us.means
## [1] 4
##
## $BlacKkKlansman.means
## [1] 4.272727
##
## $Get_Out.means
## [1] 4.666667
##
## $Star_Wars_Last_Jedi.means
## [1] 3.588235
#write a function to replace any 'NA' element with the mean for a column
NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
#make a copy to modify
ratings_NA2Mean <- ratings_df
#use lapply to apply the 'NA2mean' function to each column
ratings_NA2Mean[] <- lapply(ratings_NA2Mean, NA2mean)
#preview to dataframe
head( ratings_NA2Mean )## name Black_Panther Parasite Us BlacKkKlansman Get_Out Star_Wars_Last_Jedi
## 1 Adam 4 4.818182 4 4.000000 5.000000 5.000000
## 2 Alex 3 4.000000 2 4.000000 5.000000 3.588235
## 3 Ana 5 4.818182 4 5.000000 4.666667 4.000000
## 4 Ayanthi 4 4.818182 4 4.000000 5.000000 3.588235
## 5 Ben 4 4.818182 4 4.272727 4.000000 3.000000
## 6 Bonnie 4 5.000000 4 4.272727 5.000000 4.000000
If we compare the output for head( ratings_df ) with the output for head( ratings_NA2Mean ), we observe that the ‘NA’ values in ratings_df have been replaced with the column appropriate values that are held in the variable ratingsMeans.
This resulting dataframe, ‘ratings_NA2Mean’, gives simple ratings estimates for each missing value. For instance, I had not seen the movie ‘BlacKkKlansman’. The mean rating from my friends is 4.27, so perhaps I should get around to watching it.
Imputation of missing data values with an estimate such as the mean is a simple and straight-forward way to deal with ‘NA’ values. There are certainly many much more sophisticated modeling and prediction techniques that can be used deal with missing values (e.g. K Nearest Neighbor ). However, they are typically best when deployed on much larger datasets & are out of the scope for this assignment.