Assignment 2: Movies Database

Already done (using MySQL Workbench):

Choose six recent popular movies.

Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movies that they have seen on a scale of 1 to 5.

Take the results (observations) and store them in a SQL database. (See attached SQL script for database creation and loading.)

To be done here:

Load the information into an R dataframe, and examine it.

Load up some libraries
## Loading required package: DBI
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Connect to the MySql database and retrieve the data:

The dimensions of the results dataframe are 30, 3 .
## 'data.frame':    30 obs. of  3 variables:
##  $ Movie_title: chr  "Aladdin (2019)" "Aladdin (2019)" "Aladdin (2019)" "Aladdin (2019)" ...
##  $ Friend_name: chr  "Andrew" "Bernard" "Charlie" "Dilbert" ...
##  $ Rating     : int  1 5 1 2 4 4 1 3 2 1 ...
##  Movie_title        Friend_name            Rating    
##  Length:30          Length:30          Min.   :1.00  
##  Class :character   Class :character   1st Qu.:1.00  
##  Mode  :character   Mode  :character   Median :2.00  
##                                        Mean   :2.52  
##                                        3rd Qu.:4.00  
##                                        Max.   :5.00  
##                                        NA's   :1

List the results (there are only 30 rows):

##                  Movie_title Friend_name Rating
## 1             Aladdin (2019)      Andrew      1
## 2             Aladdin (2019)     Bernard      5
## 3             Aladdin (2019)     Charlie      1
## 4             Aladdin (2019)     Dilbert      2
## 5             Aladdin (2019)     Ernesto      4
## 6          Avengers: Endgame      Andrew      4
## 7          Avengers: Endgame     Bernard      1
## 8          Avengers: Endgame     Charlie      3
## 9          Avengers: Endgame     Dilbert      2
## 10         Avengers: Endgame     Ernesto      1
## 11            Captain Marvel      Andrew      4
## 12            Captain Marvel     Bernard      5
## 13            Captain Marvel     Charlie      5
## 14            Captain Marvel     Dilbert      2
## 15            Captain Marvel     Ernesto      4
## 16 Spider-Man: Far from Home      Andrew      2
## 17 Spider-Man: Far from Home     Bernard      1
## 18 Spider-Man: Far from Home     Charlie      1
## 19 Spider-Man: Far from Home     Dilbert      2
## 20 Spider-Man: Far from Home     Ernesto      2
## 21      The Lion King (2019)      Andrew      2
## 22      The Lion King (2019)     Bernard      1
## 23      The Lion King (2019)     Charlie      2
## 24      The Lion King (2019)     Dilbert      5
## 25      The Lion King (2019)     Ernesto      3
## 26               Toy Story 4      Andrew      3
## 27               Toy Story 4     Bernard      2
## 28               Toy Story 4     Charlie      2
## 29               Toy Story 4     Dilbert      1
## 30               Toy Story 4     Ernesto     NA
Describe the results:
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 29 2.51724 1.4046 2 2.44 1.4826 1 5 4 0.563286 -1.09891 0.260828
(Note that there is one “NA” value, which we will have to exclude later.)

Let’s look at the results, grouped by Movie :

## 
##  Descriptive statistics by group 
## group: Aladdin (2019)
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 5  2.6 1.82      2     2.6 1.48   1   5     4 0.27    -2.08 0.81
## ------------------------------------------------------------------------------------------ 
## group: Avengers: Endgame
##    vars n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 5  2.2 1.3      2     2.2 1.48   1   4     3 0.26    -1.96 0.58
## ------------------------------------------------------------------------------------------ 
## group: Captain Marvel
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 5    4 1.22      4       4 1.48   2   5     3 -0.65     -1.4 0.55
## ------------------------------------------------------------------------------------------ 
## group: Spider-Man: Far from Home
##    vars n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 5  1.6 0.55      2     1.6   0   1   2     1 -0.29    -2.25 0.24
## ------------------------------------------------------------------------------------------ 
## group: The Lion King (2019)
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 5  2.6 1.52      2     2.6 1.48   1   5     4 0.54    -1.49 0.68
## ------------------------------------------------------------------------------------------ 
## group: Toy Story 4
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 4    2 0.82      2       2 0.74   1   3     2    0    -1.88 0.41
We need to drop the item with the NA rating in order to obtain non-NA summary results.
Subsetting using !is.na(result$Rating) :
## # A tibble: 6 x 8
##   Movie_title               count   min  mean median   max    sd   IQR
##   <chr>                     <int> <int> <dbl>  <dbl> <int> <dbl> <dbl>
## 1 Aladdin (2019)                5     1   2.6      2     5 1.82    3  
## 2 Avengers: Endgame             5     1   2.2      2     4 1.30    2  
## 3 Captain Marvel                5     2   4        4     5 1.22    1  
## 4 Spider-Man: Far from Home     5     1   1.6      2     2 0.548   1  
## 5 The Lion King (2019)          5     1   2.6      2     5 1.52    1  
## 6 Toy Story 4                   4     1   2        2     3 0.816   0.5

Now, let’s make a boxplot by Movie :

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

while Spider-Man: Far from Home was at the opposite end of the spectrum, receiving the lowest ratings:
##                  Movie_title Friend_name Rating
## 16 Spider-Man: Far from Home      Andrew      2
## 17 Spider-Man: Far from Home     Bernard      1
## 18 Spider-Man: Far from Home     Charlie      1
## 19 Spider-Man: Far from Home     Dilbert      2
## 20 Spider-Man: Far from Home     Ernesto      2

Now, Let’s look at how each friend tended to rate the films:

## 
##  Descriptive statistics by group 
## group: Andrew
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 6 2.67 1.21    2.5    2.67 1.48   1   4     3 -0.04    -1.88 0.49
## ------------------------------------------------------------------------------------------ 
## group: Bernard
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 6  2.5 1.97    1.5     2.5 0.74   1   5     4 0.45    -1.98 0.81
## ------------------------------------------------------------------------------------------ 
## group: Charlie
##    vars n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 6 2.33 1.51      2    2.33 1.48   1   5     4 0.71    -1.15 0.61
## ------------------------------------------------------------------------------------------ 
## group: Dilbert
##    vars n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 6 2.33 1.37      2    2.33   0   1   5     4 1.07    -0.43 0.56
## ------------------------------------------------------------------------------------------ 
## group: Ernesto
##    vars n mean  sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 5  2.8 1.3      3     2.8 1.48   1   4     3 -0.26    -1.96 0.58
Again, we have to exclude the item with the NA:
## # A tibble: 5 x 8
##   Friend_name count   min  mean median   max    sd   IQR
##   <chr>       <int> <int> <dbl>  <dbl> <int> <dbl> <dbl>
## 1 Andrew          6     1  2.67    2.5     4  1.21  1.75
## 2 Bernard         6     1  2.5     1.5     5  1.97  3.25
## 3 Charlie         6     1  2.33    2       5  1.51  1.5 
## 4 Dilbert         6     1  2.33    2       5  1.37  0   
## 5 Ernesto         5     1  2.8     3       4  1.30  2
Boxplot by friend:
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

We observe that Bernard either likes a film or hates it – with Bernard, there is no middle ground.
Bernard gave the widest disperion among his ratings, using mostly “1”s and “5”s, which explains his large IQR and standard deviation.
His Median is the lowest, as half his ratings were “1”s:
##                  Movie_title Friend_name Rating
## 2             Aladdin (2019)     Bernard      5
## 7          Avengers: Endgame     Bernard      1
## 12            Captain Marvel     Bernard      5
## 17 Spider-Man: Far from Home     Bernard      1
## 22      The Lion King (2019)     Bernard      1
## 27               Toy Story 4     Bernard      2
Because Dilbert gave so many ratings of “2”, his IQR = 0, thus his box is flat, with outliers at “1” and “5” :
##                  Movie_title Friend_name Rating
## 4             Aladdin (2019)     Dilbert      2
## 9          Avengers: Endgame     Dilbert      2
## 14            Captain Marvel     Dilbert      2
## 19 Spider-Man: Far from Home     Dilbert      2
## 24      The Lion King (2019)     Dilbert      5
## 29               Toy Story 4     Dilbert      1

Conclusion:

With a small data set (6 movies and 5 reviewers) the aggregated figures display interesting results across both movie and reviewer.

It would be interesting to see the results across a larger sample, for example using the data assembled by “Rotton Tomatoes” which tabulates published movie reviews and scores films on a scale of 0%-100% based upon the percentage of reviews which are favorable vs. unfavorable.

Furthermore, it would be interesting to compare/contrast such “professional” assessments with opinions from individuals, such as those assembled by firms like Amazon.