Overview

Movie rating data for six recent films was collected from five family members (raters). Each family member was asked to rate each of movies they have seen on a scale of one to five. If a rater had not seen the film in question, a NULL value was put in place of a rating.


Part 1 & 2: Build Table and Store data in SQL database

The following SQL queries were used to create normalized movie rating data tables within the MySQL Workbench. The data is stored in the class (cunydata607) MySQL database in the ‘olivia.azevedo90’ schema.

-- rater table
CREATE TABLE rater (
    person_id VARCHAR(5) UNIQUE,
    person_name VARCHAR(100) UNIQUE);
INSERT INTO rater (person_id, person_name) VALUES
  ('1', 'Neil'),
  ('2', 'Myles'),
  ('3', 'Anna'),
  ('4', 'Brady'),
  ('5', 'Owen');

-- movies table
CREATE TABLE movies (
    movie_id VARCHAR(5) UNIQUE,
    movie_title VARCHAR(100) UNIQUE
);
INSERT INTO movies (movie_id, movie_title) VALUES 
    ('M1', 'Barbie'),
    ('M2', 'Conclave'),
    ('M3', 'Oppenheimer'),
    ('M4', 'Saturday Night'),
    ('M5', 'Smile 2'),
    ('M6', 'Wicked');

-- rates table
CREATE TABLE rates (
    person_id VARCHAR(5),
    movie_id VARCHAR(5),
    rating INT
);
INSERT INTO rates (person_id, movie_id, rating) VALUES
    -- Person 1
    ('1', 'M3', 5),
    ('1', 'M1', 1),
    ('1', 'M5', NULL),
    ('1', 'M6', 5),
    ('1', 'M2', 4),
    ('1', 'M4', NULL),
    -- Person 2
    ('2', 'M3', NULL),
    ('2', 'M1', 5),
    ('2', 'M5', 5),
    ('2', 'M6', NULL),
    ('2', 'M2', NULL),
    ('2', 'M4', 4),
    -- Person 3
    ('3', 'M3', NULL),
    ('3', 'M1', 3),
    ('3', 'M5', 1),
    ('3', 'M6', 5),
    ('3', 'M2', NULL),
    ('3', 'M4', 5),
    -- Person 4
    ('4', 'M3', 5),
    ('4', 'M1', 2),
    ('4', 'M5', 3),
    ('4', 'M6', NULL),
    ('4', 'M2', NULL),
    ('4', 'M4', NULL),
    -- Person 5
    ('5', 'M3', 4),
    ('5', 'M1', 3),
    ('5', 'M5', 1),
    ('5', 'M6', 3),
    ('5', 'M2', 5),
    ('5', 'M4', 4);

Part 3: Transfer data from SQL database to R dataframe

Database Connection

Define connection string to MySQL database
Note: To avoid hard coding database credentials, the database password was defined using the keyring package and is stored in the keyring store of the operating system. The stored credential can be accessed via the key_get() function as seen in the connection string below.

con = dbConnect(MySQL(),
          user='olivia.azevedo90',
          password=key_get("MySQL_PASSWORD"),
          host='cuny607sql.mysql.database.azure.com',
          port=3306,
          dbname='olivia.azevedo90')

Import Data

Load normalized data from the SQL database into an R data frame

query = '
  SELECT 
    rates.person_id,
      person_name,
      rates.movie_id,
      movie_title,
      rating
  FROM rates
  LEFT JOIN rater
    on rates.person_id = rater.person_id 
  LEFT JOIN movies
    on rates.movie_id = movies.movie_id'
df = dbGetQuery(con, query)
head(df)
##   person_id person_name movie_id    movie_title rating
## 1         1        Neil       M3    Oppenheimer      5
## 2         1        Neil       M1         Barbie      1
## 3         1        Neil       M5        Smile 2     NA
## 4         1        Neil       M6         Wicked      5
## 5         1        Neil       M2       Conclave      4
## 6         1        Neil       M4 Saturday Night     NA

Part 4: Missing data strategy

Assess Missing Data

Identify and assess extent of missing values

## Total Missing Values:  10
Column NA Counts and Data Type
column na_count dtype
person_id 0 character
person_name 0 character
movie_id 0 character
movie_title 0 character
rating 10 integer

Missing Data Approach

My approach to addressing the movie rating missing values involves filling the NA values with zero. I choose this approach due to the missing values being explicit and representing that the person providing the rating did not see the movie.

Removing the missing values would not be the most optimal approach due to:

  1. A significant amount of data (30% or 10/30 rows) would be removed from the data
    • There are no rows in which all the data is missing
  2. The missing values do not represent uncollected data or data quality issues
    • The missing values provide additional insight by indicating the rater has not seen the movie

In addition, filling the values with zero converts the rating column data type from integer to numeric. Due to wanting to maintain a numerical data type, the missing values were not filled with a string.
Lastly, the missing values were not replaced with a number other than zero (manually or via back/forward filling) since this would cause the data to be inaccurate and not reflect the true movie ratings provided by the raters.

df_nona = df |>
  mutate(rating = coalesce(rating, 0))
Column NA Counts and Data Type
column na_count dtype
person_id 0 character
person_name 0 character
movie_id 0 character
movie_title 0 character
rating 0 numeric

Bonus Challenge Questions

Are you able to use a password without having to share the password with people who are viewing your code?
Yes - please refer to the note in the “Database Connection” section

Can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated?
I created a normalized set of tables by splitting my original table into three tables: raters, movies, and rates.

  • raters
    • Columns: person_id, person_name
  • movies
    • Columns: movie_id, movie_title
    • Updates: created movie id column to make the movie title dependent on a primary key
  • rates
    • Columns: person_id, movie_id, rating
    • Updates: assigned the ratings to a person id and movie id to remove transitive partial dependency

Normalized Tables

Table: rater
person_id person_name
1 Neil
2 Myles
3 Anna
4 Brady
5 Owen
Table: movies
movie_id movie_title
M1 Barbie
M2 Conclave
M3 Oppenheimer
M4 Saturday Night
M5 Smile 2
M6 Wicked
Table: rates
person_id movie_id rating
1 M3 5
1 M1 1
1 M5 NA
1 M6 5
1 M2 4
1 M4 NA
2 M3 NA
2 M1 5
2 M5 5
2 M6 NA
2 M2 NA
2 M4 4
3 M3 NA
3 M1 3
3 M5 1
3 M6 5
3 M2 NA
3 M4 5
4 M3 5
4 M1 2
4 M5 3
4 M6 NA
4 M2 NA
4 M4 NA
5 M3 4
5 M1 3
5 M5 1
5 M6 3
5 M2 5
5 M4 4

Non-Normalized Table

Table: movie_rates
person_id person_name movie_title rating
1 Neil Oppenheimer 5
1 Neil Barbie 1
1 Neil Smile 2 NA
1 Neil Wicked 5
1 Neil Conclave 4
1 Neil Saturday Night NA
2 Myles Oppenheimer NA
2 Myles Barbie 5
2 Myles Smile 2 5
2 Myles Wicked NA
2 Myles Conclave NA
2 Myles Saturday Night 4
3 Anna Oppenheimer NA
3 Anna Barbie 3
3 Anna Smile 2 1
3 Anna Wicked 5
3 Anna Conclave NA
3 Anna Saturday Night 5
4 Brady Oppenheimer 5
4 Brady Barbie 2
4 Brady Smile 2 3
4 Brady Wicked NA
4 Brady Conclave NA
4 Brady Saturday Night NA
5 Owen Oppenheimer 4
5 Owen Barbie 3
5 Owen Smile 2 1
5 Owen Wicked 3
5 Owen Conclave 5
5 Owen Saturday Night 4

Exploratory Data Analysis

Movie Rating Distribution Plots

Overall Rating Distribution

df_nona |>
  group_by(rating) |>
  tally() |>
  ggplot(aes(x = rating, y = n)) +
    geom_col() +
    labs(
      title = "Movie Rating Distribution",
      x = 'Rating',
      y = 'Count') + 
    theme_classic()

Rating Distribution by Rater

df_nona |>
  group_by(rating,  person_name) |>
  tally() |>
  ggplot(aes(x = rating, y = n, fill=person_name)) +
    geom_col(position = position_dodge2(width = 0.9, preserve = "single")) +
    labs(
      title = "Movie Rating Distribution by Rater",
      x = 'Rating',
      y = 'Count') + 
    theme_classic()

Movie Rating Plots

All Movie Ratings by Rater

df_nona |>
  filter(rating >=1) |>
  ggplot(
    aes(x = movie_title, y = rating, fill=person_name)) +
    geom_col(position = position_dodge2(width = 0.9, preserve = "single")) +
    labs(
      title = "Movie Rating Distribution by Rater",
      x = 'Movie',
      y = 'Rating') + 
    theme_classic()

Average Movie Rating

df_nona |>
  group_by(movie_title) |>
  summarise(avg_ratings = mean(rating)) |>
  ggplot(
    aes(x = reorder(movie_title, -avg_ratings), y = avg_ratings)) +
    geom_col(fill='#2d3c52') +
    labs(
      title = "Average Movie Rating",
      x = 'Movie',
      y = 'Average Rating') + 
    theme_classic()

Average Movie Rating by Rater

df_nona |>
  group_by(person_name) |>
  summarise(avg_ratings = mean(rating)) |>
  ggplot(
    aes(x = reorder(person_name, -avg_ratings), y = avg_ratings)) +
    geom_col(fill='#557874') +
    labs(
      title = "Rater Average Movie Rating",
      x = 'Rater',
      y = 'Average Rating') + 
    theme_classic()

Movies Watched Plots

Movies Watched Distribution

df_nona |>
  filter(rating >=1) |>
  group_by(movie_title) |>
  tally() |>
  ggplot(
    aes(x = reorder(movie_title, -n), y = n)) +
    geom_col(fill='#2d3c52') +
    labs(
      title = "Movies Watched Distribution",
      x = 'Movie',
      y = 'Watched Count') + 
    theme_classic()

Movies Watched by Rater Distribution

df_rater_n = df_nona |>
  filter(rating >=1) |>
  group_by(person_name) |>
  tally()

  ggplot(df_rater_n,
    aes(x = reorder(person_name, -n), y = n)) +
    geom_col(fill='#557874') +
    labs(
      title = "Movies Watched by Rater Distribution",
      x = 'Movie',
      y = 'Watched Count') + 
    theme_classic()

Movie Rating & Movies Watched Plot

Average Rating per the Number of Movies Watched by the Rater

merge(df_nona, df_rater_n) |>
  group_by(n) |>
  summarise(avg_ratings = mean(rating)) |>
  ggplot(
    aes(x = reorder(n, -avg_ratings), y = avg_ratings)) +
  geom_col(fill='#2d3c52') +
  labs(
      title = "Average Rating per Number of Movies Watched",
      x = "Number of Movie's Watched",
      y = 'Average Rating') + 
    theme_classic()


Conclusions

Due to this data set being so small and having many missing rating values, there is not enough evidence to draw any meaningful conclusions. Some potential insights include:

  • The number of movie’s watched being positively correlated to movie ratings
  • The movie Barbie being the most highly rated movie of the six movies included in the survey
    • Note: This is most likely influenced by the movie Barbie being the only movie that all raters watched and provided a rating for

To improve this analysis, I would recommend collecting more movie ratings via surveying a much larger population. A much larger data set will provide a better representation of overall movie ratings, as well as provide opportunity to apply a different approach to addressing missing rating values.