Introduction

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends if necessary) to rate each of these movies that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database of your choosing. Load the information from the SQL database into an R dataframe.

Pior to created code, I created 2 CSV files and uploaded them into GitHub.
Additionally, I created and connected to a cloud-based database.

In the below code, I first captured the data on CSV files into dataframes. Followed by establishing a connection to a dattabase. I, then, queried the tables and saved the results to a dataframe. Lastly, I performed analyses on the results.

Connecting to Data

Here, I loaded the readr package to read the csv files on GitHub and saved them to dataframes.

library(readr)

friendsUrl <- 'https://raw.githubusercontent.com/dcorrea614/MSDS/master/assignment2%20-%20friends.csv'
ratingURL <- 'https://raw.githubusercontent.com/dcorrea614/MSDS/master/assignment2%20-%20rating.csv'

dfFriends <- read_csv(file = friendsUrl)

## Parsed with column specification:
## cols(
##   FriendsID = col_double(),
##   FirstName = col_character(),
##   LastName = col_character()
## )

dfRating <- read_csv(file = ratingURL)

## Parsed with column specification:
## cols(
##   RatingID = col_double(),
##   FriendsID = col_double(),
##   Movie = col_character(),
##   Rating = col_double()
## )

dfFriends

## # A tibble: 5 x 3
##   FriendsID FirstName LastName
##       <dbl> <chr>     <chr>   
## 1         1 John      Doe     
## 2         2 Jane      Doe     
## 3         3 Peter     Pascal  
## 4         4 Paul      Panini  
## 5         5 Mary      Meyer

dfRating

## # A tibble: 30 x 4
##    RatingID FriendsID Movie          Rating
##       <dbl>     <dbl> <chr>           <dbl>
##  1        1         1 Ford v Ferrari      3
##  2        2         1 The Irishman       NA
##  3        3         1 Jojo Rabbit         5
##  4        4         1 Joker               5
##  5        5         1 Marriage Story      4
##  6        6         1 Parasite           NA
##  7        7         2 Ford v Ferrari      3
##  8        8         2 The Irishman       NA
##  9        9         2 Jojo Rabbit         5
## 10       10         2 Joker               5
## # ... with 20 more rows

Connecting to Database

To connect to the cloud based database, I used the RMySQL package.

library(RMySQL)

## Loading required package: DBI

con <- dbConnect(MySQL(),
                 user = 'root',
                 host = '34.122.92.218',
                 dbname = 'DATA607')

summary(con)

## <MySQLConnection:0,0>
##   User:   root 
##   Host:   34.122.92.218 
##   Dbname: DATA607 
##   Connection type: 34.122.92.218 via TCP/IP 
## 
## Results:

Creating Tables

Dumping the dataframes into tables.

dbWriteTable(con, 'friends', dfFriends, overwrite = TRUE)

## [1] TRUE

dbWriteTable(con, 'rating', dfRating, overwrite = TRUE)

## [1] TRUE

Query Table

Query the tables and joined on FriendsID.

res <- dbGetQuery(con, 'select concat(FirstName, " ", LastName) as Friend, Movie, Rating 
                  from rating inner join friends on rating.FriendsID = friends.FriendsID;')
res

##          Friend          Movie Rating
## 1      John Doe Ford v Ferrari      3
## 2      John Doe   The Irishman     NA
## 3      John Doe    Jojo Rabbit      5
## 4      John Doe          Joker      5
## 5      John Doe Marriage Story      4
## 6      John Doe       Parasite     NA
## 7      Jane Doe Ford v Ferrari      3
## 8      Jane Doe   The Irishman     NA
## 9      Jane Doe    Jojo Rabbit      5
## 10     Jane Doe          Joker      5
## 11     Jane Doe Marriage Story      4
## 12     Jane Doe       Parasite      4
## 13 Peter Pascal Ford v Ferrari      3
## 14 Peter Pascal   The Irishman     NA
## 15 Peter Pascal    Jojo Rabbit      5
## 16 Peter Pascal          Joker      5
## 17 Peter Pascal Marriage Story     NA
## 18 Peter Pascal       Parasite      2
## 19  Paul Panini Ford v Ferrari      5
## 20  Paul Panini   The Irishman      3
## 21  Paul Panini    Jojo Rabbit      4
## 22  Paul Panini          Joker      4
## 23  Paul Panini Marriage Story      5
## 24  Paul Panini       Parasite      4
## 25   Mary Meyer Ford v Ferrari     NA
## 26   Mary Meyer   The Irishman     NA
## 27   Mary Meyer    Jojo Rabbit      5
## 28   Mary Meyer          Joker      5
## 29   Mary Meyer Marriage Story     NA
## 30   Mary Meyer       Parasite      3

Analyze Results

Using the psych package, we are able to see summary statics of the Rating, grouped by Movie.

Finally, I used the ggplot package to visually see the ratings by Movie.

library(psych)
describeBy(res$Rating, group = res$Movie)

## 
##  Descriptive statistics by group 
## group: Ford v Ferrari
##    vars n mean sd median trimmed mad min max range skew kurtosis  se
## X1    1 4  3.5  1      3     3.5   0   3   5     2 0.75    -1.69 0.5
## ------------------------------------------------------------ 
## group: Jojo Rabbit
##    vars n mean   sd median trimmed mad min max range  skew kurtosis  se
## X1    1 5  4.8 0.45      5     4.8   0   4   5     1 -1.07    -0.92 0.2
## ------------------------------------------------------------ 
## group: Joker
##    vars n mean   sd median trimmed mad min max range  skew kurtosis  se
## X1    1 5  4.8 0.45      5     4.8   0   4   5     1 -1.07    -0.92 0.2
## ------------------------------------------------------------ 
## group: Marriage Story
##    vars n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 3 4.33 0.58      4    4.33   0   4   5     1 0.38    -2.33 0.33
## ------------------------------------------------------------ 
## group: Parasite
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 4 3.25 0.96    3.5    3.25 0.74   2   4     2 -0.32    -2.08 0.48
## ------------------------------------------------------------ 
## group: The Irishman
##    vars n mean sd median trimmed mad min max range skew kurtosis se
## X1    1 1    3 NA      3       3   0   3   3     0   NA       NA NA

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(res, aes(x = Rating)) + geom_histogram(fill = 'blue', binwidth = 1) + 
  facet_wrap(~Movie)

## Warning: Removed 8 rows containing non-finite values (stat_bin).

Conclusion

Based on the info, Jojo Rabbit and Joker have the highest average ratings, while The Irishman received the worst.

DATA607-Assignment2

Diego Correa