DATA607_Week 2 Assignment

Overview

Movie rating data for six recent films was collected from five family members (raters). Each family member was asked to rate each of movies they have seen on a scale of one to five. If a rater had not seen the film in question, a NULL value was put in place of a rating.

Part 1 & 2: Build Table and Store data in SQL database

The following SQL queries were used to create normalized movie rating data tables within the MySQL Workbench. The data is stored in the class (cunydata607) MySQL database in the ‘olivia.azevedo90’ schema.

-- rater table
CREATE TABLE rater (
    person_id VARCHAR(5) UNIQUE,
    person_name VARCHAR(100) UNIQUE);
INSERT INTO rater (person_id, person_name) VALUES
  ('1', 'Neil'),
  ('2', 'Myles'),
  ('3', 'Anna'),
  ('4', 'Brady'),
  ('5', 'Owen');

-- movies table
CREATE TABLE movies (
    movie_id VARCHAR(5) UNIQUE,
    movie_title VARCHAR(100) UNIQUE
);
INSERT INTO movies (movie_id, movie_title) VALUES 
    ('M1', 'Barbie'),
    ('M2', 'Conclave'),
    ('M3', 'Oppenheimer'),
    ('M4', 'Saturday Night'),
    ('M5', 'Smile 2'),
    ('M6', 'Wicked');

-- rates table
CREATE TABLE rates (
    person_id VARCHAR(5),
    movie_id VARCHAR(5),
    rating INT
);
INSERT INTO rates (person_id, movie_id, rating) VALUES
    -- Person 1
    ('1', 'M3', 5),
    ('1', 'M1', 1),
    ('1', 'M5', NULL),
    ('1', 'M6', 5),
    ('1', 'M2', 4),
    ('1', 'M4', NULL),
    -- Person 2
    ('2', 'M3', NULL),
    ('2', 'M1', 5),
    ('2', 'M5', 5),
    ('2', 'M6', NULL),
    ('2', 'M2', NULL),
    ('2', 'M4', 4),
    -- Person 3
    ('3', 'M3', NULL),
    ('3', 'M1', 3),
    ('3', 'M5', 1),
    ('3', 'M6', 5),
    ('3', 'M2', NULL),
    ('3', 'M4', 5),
    -- Person 4
    ('4', 'M3', 5),
    ('4', 'M1', 2),
    ('4', 'M5', 3),
    ('4', 'M6', NULL),
    ('4', 'M2', NULL),
    ('4', 'M4', NULL),
    -- Person 5
    ('5', 'M3', 4),
    ('5', 'M1', 3),
    ('5', 'M5', 1),
    ('5', 'M6', 3),
    ('5', 'M2', 5),
    ('5', 'M4', 4);

Part 3: Transfer data from SQL database to R dataframe

Database Connection

Define connection string to MySQL database
Note: To avoid hard coding database credentials, the database password was defined using the keyring package and is stored in the keyring store of the operating system. The stored credential can be accessed via the key_get() function as seen in the connection string below.

con = dbConnect(MySQL(),
          user='olivia.azevedo90',
          password=key_get("MySQL_PASSWORD"),
          host='cuny607sql.mysql.database.azure.com',
          port=3306,
          dbname='olivia.azevedo90')

Import Data

Load normalized data from the SQL database into an R data frame

query = '
  SELECT 
    rates.person_id,
      person_name,
      rates.movie_id,
      movie_title,
      rating
  FROM rates
  LEFT JOIN rater
    on rates.person_id = rater.person_id 
  LEFT JOIN movies
    on rates.movie_id = movies.movie_id'
df = dbGetQuery(con, query)
head(df)

##   person_id person_name movie_id    movie_title rating
## 1         1        Neil       M3    Oppenheimer      5
## 2         1        Neil       M1         Barbie      1
## 3         1        Neil       M5        Smile 2     NA
## 4         1        Neil       M6         Wicked      5
## 5         1        Neil       M2       Conclave      4
## 6         1        Neil       M4 Saturday Night     NA

Part 4: Missing data strategy

Assess Missing Data

Identify and assess extent of missing values

## Total Missing Values:  10

Column NA Counts and Data Type
column	na_count	dtype
person_id	0	character
person_name	0	character
movie_id	0	character
movie_title	0	character
rating	10	integer

Missing Data Approach

My approach to addressing the movie rating missing values involves filling the NA values with zero. I choose this approach due to the missing values being explicit and representing that the person providing the rating did not see the movie.

Removing the missing values would not be the most optimal approach due to:

A significant amount of data (30% or 10/30 rows) would be removed from the data
- There are no rows in which all the data is missing
The missing values do not represent uncollected data or data quality issues
- The missing values provide additional insight by indicating the rater has not seen the movie

In addition, filling the values with zero converts the rating column data type from integer to numeric. Due to wanting to maintain a numerical data type, the missing values were not filled with a string.
Lastly, the missing values were not replaced with a number other than zero (manually or via back/forward filling) since this would cause the data to be inaccurate and not reflect the true movie ratings provided by the raters.

df_nona = df |>
  mutate(rating = coalesce(rating, 0))

Column NA Counts and Data Type
column	na_count	dtype
person_id	0	character
person_name	0	character
movie_id	0	character
movie_title	0	character
rating	0	numeric

Bonus Challenge Questions

Are you able to use a password without having to share the password with people who are viewing your code?
Yes - please refer to the note in the “Database Connection” section

Can you create a normalized set of tables that corresponds to the relationship between your movie viewing friends and the movies being rated?
I created a normalized set of tables by splitting my original table into three tables: raters, movies, and rates.

raters
- Columns: person_id, person_name
movies
- Columns: movie_id, movie_title
- Updates: created movie id column to make the movie title dependent on a primary key
rates
- Columns: person_id, movie_id, rating
- Updates: assigned the ratings to a person id and movie id to remove transitive partial dependency

Normalized Tables

Table: rater
person_id	person_name
1	Neil
2	Myles
3	Anna
4	Brady
5	Owen

Table: movies
movie_id	movie_title
M1	Barbie
M2	Conclave
M3	Oppenheimer
M4	Saturday Night
M5	Smile 2
M6	Wicked

Table: rates
person_id	movie_id	rating
1	M3	5
1	M1	1
1	M5	NA
1	M6	5
1	M2	4
1	M4	NA
2	M3	NA
2	M1	5
2	M5	5
2	M6	NA
2	M2	NA
2	M4	4
3	M3	NA
3	M1	3
3	M5	1
3	M6	5
3	M2	NA
3	M4	5
4	M3	5
4	M1	2
4	M5	3
4	M6	NA
4	M2	NA
4	M4	NA
5	M3	4
5	M1	3
5	M5	1
5	M6	3
5	M2	5
5	M4	4

Non-Normalized Table

Table: movie_rates
person_id	person_name	movie_title	rating
1	Neil	Oppenheimer	5
1	Neil	Barbie	1
1	Neil	Smile 2	NA
1	Neil	Wicked	5
1	Neil	Conclave	4
1	Neil	Saturday Night	NA
2	Myles	Oppenheimer	NA
2	Myles	Barbie	5
2	Myles	Smile 2	5
2	Myles	Wicked	NA
2	Myles	Conclave	NA
2	Myles	Saturday Night	4
3	Anna	Oppenheimer	NA
3	Anna	Barbie	3
3	Anna	Smile 2	1
3	Anna	Wicked	5
3	Anna	Conclave	NA
3	Anna	Saturday Night	5
4	Brady	Oppenheimer	5
4	Brady	Barbie	2
4	Brady	Smile 2	3
4	Brady	Wicked	NA
4	Brady	Conclave	NA
4	Brady	Saturday Night	NA
5	Owen	Oppenheimer	4
5	Owen	Barbie	3
5	Owen	Smile 2	1
5	Owen	Wicked	3
5	Owen	Conclave	5
5	Owen	Saturday Night	4

Exploratory Data Analysis

Movie Rating Distribution Plots

Overall Rating Distribution

df_nona |>
  group_by(rating) |>
  tally() |>
  ggplot(aes(x = rating, y = n)) +
    geom_col() +
    labs(
      title = "Movie Rating Distribution",
      x = 'Rating',
      y = 'Count') + 
    theme_classic()

Rating Distribution by Rater

df_nona |>
  group_by(rating,  person_name) |>
  tally() |>
  ggplot(aes(x = rating, y = n, fill=person_name)) +
    geom_col(position = position_dodge2(width = 0.9, preserve = "single")) +
    labs(
      title = "Movie Rating Distribution by Rater",
      x = 'Rating',
      y = 'Count') + 
    theme_classic()

Movie Rating Plots

All Movie Ratings by Rater

df_nona |>
  filter(rating >=1) |>
  ggplot(
    aes(x = movie_title, y = rating, fill=person_name)) +
    geom_col(position = position_dodge2(width = 0.9, preserve = "single")) +
    labs(
      title = "Movie Rating Distribution by Rater",
      x = 'Movie',
      y = 'Rating') + 
    theme_classic()

Average Movie Rating

df_nona |>
  group_by(movie_title) |>
  summarise(avg_ratings = mean(rating)) |>
  ggplot(
    aes(x = reorder(movie_title, -avg_ratings), y = avg_ratings)) +
    geom_col(fill='#2d3c52') +
    labs(
      title = "Average Movie Rating",
      x = 'Movie',
      y = 'Average Rating') + 
    theme_classic()

Average Movie Rating by Rater

df_nona |>
  group_by(person_name) |>
  summarise(avg_ratings = mean(rating)) |>
  ggplot(
    aes(x = reorder(person_name, -avg_ratings), y = avg_ratings)) +
    geom_col(fill='#557874') +
    labs(
      title = "Rater Average Movie Rating",
      x = 'Rater',
      y = 'Average Rating') + 
    theme_classic()

Movies Watched Plots

Movies Watched Distribution

df_nona |>
  filter(rating >=1) |>
  group_by(movie_title) |>
  tally() |>
  ggplot(
    aes(x = reorder(movie_title, -n), y = n)) +
    geom_col(fill='#2d3c52') +
    labs(
      title = "Movies Watched Distribution",
      x = 'Movie',
      y = 'Watched Count') + 
    theme_classic()

Movies Watched by Rater Distribution

df_rater_n = df_nona |>
  filter(rating >=1) |>
  group_by(person_name) |>
  tally()

  ggplot(df_rater_n,
    aes(x = reorder(person_name, -n), y = n)) +
    geom_col(fill='#557874') +
    labs(
      title = "Movies Watched by Rater Distribution",
      x = 'Movie',
      y = 'Watched Count') + 
    theme_classic()

Movie Rating & Movies Watched Plot

Average Rating per the Number of Movies Watched by the Rater

merge(df_nona, df_rater_n) |>
  group_by(n) |>
  summarise(avg_ratings = mean(rating)) |>
  ggplot(
    aes(x = reorder(n, -avg_ratings), y = avg_ratings)) +
  geom_col(fill='#2d3c52') +
  labs(
      title = "Average Rating per Number of Movies Watched",
      x = "Number of Movie's Watched",
      y = 'Average Rating') + 
    theme_classic()

Conclusions

Due to this data set being so small and having many missing rating values, there is not enough evidence to draw any meaningful conclusions. Some potential insights include:

The number of movie’s watched being positively correlated to movie ratings
The movie Barbie being the most highly rated movie of the six movies included in the survey
- Note: This is most likely influenced by the movie Barbie being the only movie that all raters watched and provided a rating for

To improve this analysis, I would recommend collecting more movie ratings via surveying a much larger population. A much larger data set will provide a better representation of overall movie ratings, as well as provide opportunity to apply a different approach to addressing missing rating values.