Using R and SQL to Analyze My Favorite Movies

Overview

This R Markdown will explain how to connect to a MySQL Database, create a table, and visualize the data.

Data Collection

For this exercise, 12 survey recipients were contacted to rate six popular movies. The respondents were given instructions to rate the movies with five as the highest rating, and to skip movies they had not seen.
In the code below, the RStudio API package is used to generate a pop-up message requesting the username and password for the MySQL database. However, you can also use the .REnviron to save the respective username and password.

username <- rstudioapi::askForPassword("Database username")
password <- rstudioapi::askForPassword("Database password")

The survey responses were collected in Google Forms, and were then exported to a .csv file. In the code below, the survey responses are read into R, and then used to create a MySQL table, with the connection established in the step above.
To better understand the dataset, the ‘DESCRIBE’ sql function can be used, additionally, the head() and glimpse() functions can be used to get a quick understanding of the dataset.

movieRatingData <- read.csv(file="Movie_Rating_Responses.csv", 
                            header=TRUE, 
                            sep=",")

dbWriteTable(mydb, 
             value = movieRatingData, 
             row.names = FALSE, 
             name = "movie_ratings", 
             append = TRUE )

## [1] TRUE

sql_table <- dbReadTable(mydb, "movie_ratings")

dbSendQuery(mydb, "DESCRIBE movie_ratings") |>  
  fetch()

##                  Field            Type Null Key Default
## 1            my_row_id bigint unsigned   NO PRI    <NA>
## 2            Timestamp            text  YES        <NA>
## 3 Shawshank.Redemption          bigint  YES        <NA>
## 4        The.Lion.King          bigint  YES        <NA>
## 5        Step.Brothers          bigint  YES        <NA>
## 6         Pulp.Fiction          bigint  YES        <NA>
## 7           Fight.Club          bigint  YES        <NA>
##                      Extra
## 1 auto_increment INVISIBLE
## 2                         
## 3                         
## 4                         
## 5                         
## 6                         
## 7

head(sql_table)

##           Timestamp Shawshank.Redemption The.Lion.King Step.Brothers
## 1 2/9/2025 14:37:08                    5             5             5
## 2 2/9/2025 14:39:50                    4             5             4
## 3 2/9/2025 14:41:08                    5             4             3
## 4 2/9/2025 14:47:06                    5             5             4
## 5 2/9/2025 14:50:52                    5             4             5
## 6 2/9/2025 14:52:14                    5             4             3
##   Pulp.Fiction Fight.Club
## 1            5          5
## 2           NA          1
## 3            4          4
## 4            5          5
## 5            1          1
## 6            4          3

glimpse(sql_table)

## Rows: 77
## Columns: 6
## $ Timestamp            <chr> "2/9/2025 14:37:08", "2/9/2025 14:39:50", "2/9/20…
## $ Shawshank.Redemption <dbl> 5, 4, 5, 5, 5, 5, 4, 1, 5, NA, NA, 5, 4, 5, 5, 5,…
## $ The.Lion.King        <dbl> 5, 5, 4, 5, 4, 4, 5, 2, 4, 4, NA, 5, 5, 4, 5, 4, …
## $ Step.Brothers        <dbl> 5, 4, 3, 4, 5, 3, 4, 5, 4, 5, 5, 5, 4, 3, 4, 5, 3…
## $ Pulp.Fiction         <dbl> 5, NA, 4, 5, 1, 4, 5, 4, 2, 5, 1, 5, NA, 4, 5, 1,…
## $ Fight.Club           <dbl> 5, 1, 4, 5, 1, 3, 5, 3, 2, NA, NA, 5, 1, 4, 5, 1,…

Then, to visualize the data, first it is helpful to know what the column names are in the MySQL table, “movie_ratings”. The column names are used to pivot the dataframe longer, so we can create a bar graph depicting the received responses.

Using the ‘ggplot2’ package, bar charts for each of the movies are plotted, in one object. This can be accomplished by using the ‘facet_wrap()’ argument.

colnames(sql_table)

## [1] "Timestamp"            "Shawshank.Redemption" "The.Lion.King"       
## [4] "Step.Brothers"        "Pulp.Fiction"         "Fight.Club"

movieRatingData_v2 <- sql_table |> 
  pivot_longer( cols = c("Shawshank.Redemption",
                         "The.Lion.King",
                         "Step.Brothers",
                         "Pulp.Fiction",
                         "Fight.Club"),
                names_to = "Movies", values_to = "Rating") |> 
  group_by(Movies,Rating) |> 
  summarize(Count=n()) |> 
  mutate(Movies=str_replace_all(Movies,"\\."," ")) 

ggplot(movieRatingData_v2)+
  geom_bar(aes(x=Rating,y=Count, fill=Movies),stat="identity")+
  scale_fill_brewer(palette = "Pastel1")+
  facet_wrap(~Movies)+
  theme(legend.position="none",
        strip.background = element_rect(
     color="grey", fill="white", size=.75, linetype="solid"))

Note about missing data

Survey respondents were instructed to leave responses blank if they had not seen a movie. This meant that for some movies, there are missing data. For this exercise, missing data was ignored. If more data about the respondents was collected, for example, demographic data, estimates for missing data may have been included through imputations.

Using R and SQL to Analyze My Favorite Movies

Alyssa Gurkas

2025-02-09

Overview

Data Collection

Note about missing data