Movie rating data for six recent films was collected from five family members (raters). Each family member was asked to rate each of movies they have seen on a scale of one to five. If a rater had not seen the film in question, a NULL value was put in place of a rating.
The following SQL queries were used to create normalized movie rating data tables within the MySQL Workbench. The data is stored in the class (cunydata607) MySQL database in the ‘olivia.azevedo90’ schema.
-- rater table
CREATE TABLE rater (
person_id VARCHAR(5) UNIQUE,
person_name VARCHAR(100) UNIQUE);
INSERT INTO rater (person_id, person_name) VALUES
('1', 'Neil'),
('2', 'Myles'),
('3', 'Anna'),
('4', 'Brady'),
('5', 'Owen');
-- movies table
CREATE TABLE movies (
movie_id VARCHAR(5) UNIQUE,
movie_title VARCHAR(100) UNIQUE
);
INSERT INTO movies (movie_id, movie_title) VALUES
('M1', 'Barbie'),
('M2', 'Conclave'),
('M3', 'Oppenheimer'),
('M4', 'Saturday Night'),
('M5', 'Smile 2'),
('M6', 'Wicked');
-- rates table
CREATE TABLE rates (
person_id VARCHAR(5),
movie_id VARCHAR(5),
rating INT
);
INSERT INTO rates (person_id, movie_id, rating) VALUES
-- Person 1
('1', 'M3', 5),
('1', 'M1', 1),
('1', 'M5', NULL),
('1', 'M6', 5),
('1', 'M2', 4),
('1', 'M4', NULL),
-- Person 2
('2', 'M3', NULL),
('2', 'M1', 5),
('2', 'M5', 5),
('2', 'M6', NULL),
('2', 'M2', NULL),
('2', 'M4', 4),
-- Person 3
('3', 'M3', NULL),
('3', 'M1', 3),
('3', 'M5', 1),
('3', 'M6', 5),
('3', 'M2', NULL),
('3', 'M4', 5),
-- Person 4
('4', 'M3', 5),
('4', 'M1', 2),
('4', 'M5', 3),
('4', 'M6', NULL),
('4', 'M2', NULL),
('4', 'M4', NULL),
-- Person 5
('5', 'M3', 4),
('5', 'M1', 3),
('5', 'M5', 1),
('5', 'M6', 3),
('5', 'M2', 5),
('5', 'M4', 4);
Define connection string to MySQL database
Note: To avoid hard coding database credentials, the
database password was defined using the keyring package and
is stored in the keyring store of the operating system. The stored
credential can be accessed via the key_get() function as
seen in the connection string below.
con = dbConnect(MySQL(),
user='olivia.azevedo90',
password=key_get("MySQL_PASSWORD"),
host='cuny607sql.mysql.database.azure.com',
port=3306,
dbname='olivia.azevedo90')
Load normalized data from the SQL database into an R data frame
query = '
SELECT
rates.person_id,
person_name,
rates.movie_id,
movie_title,
rating
FROM rates
LEFT JOIN rater
on rates.person_id = rater.person_id
LEFT JOIN movies
on rates.movie_id = movies.movie_id'
df = dbGetQuery(con, query)
head(df)
## person_id person_name movie_id movie_title rating
## 1 1 Neil M3 Oppenheimer 5
## 2 1 Neil M1 Barbie 1
## 3 1 Neil M5 Smile 2 NA
## 4 1 Neil M6 Wicked 5
## 5 1 Neil M2 Conclave 4
## 6 1 Neil M4 Saturday Night NA
Identify and assess extent of missing values
## Total Missing Values: 10
| column | na_count | dtype |
|---|---|---|
| person_id | 0 | character |
| person_name | 0 | character |
| movie_id | 0 | character |
| movie_title | 0 | character |
| rating | 10 | integer |
My approach to addressing the movie rating missing values involves
filling the NA values with zero. I choose this approach due to the
missing values being explicit and representing that the person providing
the rating did not see the movie.
Removing the missing values would not be the most optimal approach due to:
In addition, filling the values with zero converts the rating column
data type from integer to numeric. Due to wanting to maintain a
numerical data type, the missing values were not filled with a
string.
Lastly, the missing values were not replaced with a number other than
zero (manually or via back/forward filling) since this would cause the
data to be inaccurate and not reflect the true movie ratings provided by
the raters.
df_nona = df |>
mutate(rating = coalesce(rating, 0))
| column | na_count | dtype |
|---|---|---|
| person_id | 0 | character |
| person_name | 0 | character |
| movie_id | 0 | character |
| movie_title | 0 | character |
| rating | 0 | numeric |
Are you able to use a password without
having to share the password with people who are viewing your code?
Yes - please refer to the note in the “Database Connection” section
Can you create a normalized set of
tables that corresponds to the relationship between your movie viewing
friends and the movies being rated?
I created a normalized set of tables by splitting my original table into
three tables: raters, movies, and rates.
| person_id | person_name |
|---|---|
| 1 | Neil |
| 2 | Myles |
| 3 | Anna |
| 4 | Brady |
| 5 | Owen |
| movie_id | movie_title |
|---|---|
| M1 | Barbie |
| M2 | Conclave |
| M3 | Oppenheimer |
| M4 | Saturday Night |
| M5 | Smile 2 |
| M6 | Wicked |
| person_id | movie_id | rating |
|---|---|---|
| 1 | M3 | 5 |
| 1 | M1 | 1 |
| 1 | M5 | NA |
| 1 | M6 | 5 |
| 1 | M2 | 4 |
| 1 | M4 | NA |
| 2 | M3 | NA |
| 2 | M1 | 5 |
| 2 | M5 | 5 |
| 2 | M6 | NA |
| 2 | M2 | NA |
| 2 | M4 | 4 |
| 3 | M3 | NA |
| 3 | M1 | 3 |
| 3 | M5 | 1 |
| 3 | M6 | 5 |
| 3 | M2 | NA |
| 3 | M4 | 5 |
| 4 | M3 | 5 |
| 4 | M1 | 2 |
| 4 | M5 | 3 |
| 4 | M6 | NA |
| 4 | M2 | NA |
| 4 | M4 | NA |
| 5 | M3 | 4 |
| 5 | M1 | 3 |
| 5 | M5 | 1 |
| 5 | M6 | 3 |
| 5 | M2 | 5 |
| 5 | M4 | 4 |
| person_id | person_name | movie_title | rating |
|---|---|---|---|
| 1 | Neil | Oppenheimer | 5 |
| 1 | Neil | Barbie | 1 |
| 1 | Neil | Smile 2 | NA |
| 1 | Neil | Wicked | 5 |
| 1 | Neil | Conclave | 4 |
| 1 | Neil | Saturday Night | NA |
| 2 | Myles | Oppenheimer | NA |
| 2 | Myles | Barbie | 5 |
| 2 | Myles | Smile 2 | 5 |
| 2 | Myles | Wicked | NA |
| 2 | Myles | Conclave | NA |
| 2 | Myles | Saturday Night | 4 |
| 3 | Anna | Oppenheimer | NA |
| 3 | Anna | Barbie | 3 |
| 3 | Anna | Smile 2 | 1 |
| 3 | Anna | Wicked | 5 |
| 3 | Anna | Conclave | NA |
| 3 | Anna | Saturday Night | 5 |
| 4 | Brady | Oppenheimer | 5 |
| 4 | Brady | Barbie | 2 |
| 4 | Brady | Smile 2 | 3 |
| 4 | Brady | Wicked | NA |
| 4 | Brady | Conclave | NA |
| 4 | Brady | Saturday Night | NA |
| 5 | Owen | Oppenheimer | 4 |
| 5 | Owen | Barbie | 3 |
| 5 | Owen | Smile 2 | 1 |
| 5 | Owen | Wicked | 3 |
| 5 | Owen | Conclave | 5 |
| 5 | Owen | Saturday Night | 4 |
Overall Rating Distribution
df_nona |>
group_by(rating) |>
tally() |>
ggplot(aes(x = rating, y = n)) +
geom_col() +
labs(
title = "Movie Rating Distribution",
x = 'Rating',
y = 'Count') +
theme_classic()
Rating Distribution by Rater
df_nona |>
group_by(rating, person_name) |>
tally() |>
ggplot(aes(x = rating, y = n, fill=person_name)) +
geom_col(position = position_dodge2(width = 0.9, preserve = "single")) +
labs(
title = "Movie Rating Distribution by Rater",
x = 'Rating',
y = 'Count') +
theme_classic()
All Movie Ratings by Rater
df_nona |>
filter(rating >=1) |>
ggplot(
aes(x = movie_title, y = rating, fill=person_name)) +
geom_col(position = position_dodge2(width = 0.9, preserve = "single")) +
labs(
title = "Movie Rating Distribution by Rater",
x = 'Movie',
y = 'Rating') +
theme_classic()
Average Movie Rating
df_nona |>
group_by(movie_title) |>
summarise(avg_ratings = mean(rating)) |>
ggplot(
aes(x = reorder(movie_title, -avg_ratings), y = avg_ratings)) +
geom_col(fill='#2d3c52') +
labs(
title = "Average Movie Rating",
x = 'Movie',
y = 'Average Rating') +
theme_classic()
Average Movie Rating by Rater
df_nona |>
group_by(person_name) |>
summarise(avg_ratings = mean(rating)) |>
ggplot(
aes(x = reorder(person_name, -avg_ratings), y = avg_ratings)) +
geom_col(fill='#557874') +
labs(
title = "Rater Average Movie Rating",
x = 'Rater',
y = 'Average Rating') +
theme_classic()
Movies Watched Distribution
df_nona |>
filter(rating >=1) |>
group_by(movie_title) |>
tally() |>
ggplot(
aes(x = reorder(movie_title, -n), y = n)) +
geom_col(fill='#2d3c52') +
labs(
title = "Movies Watched Distribution",
x = 'Movie',
y = 'Watched Count') +
theme_classic()
Movies Watched by Rater Distribution
df_rater_n = df_nona |>
filter(rating >=1) |>
group_by(person_name) |>
tally()
ggplot(df_rater_n,
aes(x = reorder(person_name, -n), y = n)) +
geom_col(fill='#557874') +
labs(
title = "Movies Watched by Rater Distribution",
x = 'Movie',
y = 'Watched Count') +
theme_classic()
Average Rating per the Number of Movies Watched by the Rater
merge(df_nona, df_rater_n) |>
group_by(n) |>
summarise(avg_ratings = mean(rating)) |>
ggplot(
aes(x = reorder(n, -avg_ratings), y = avg_ratings)) +
geom_col(fill='#2d3c52') +
labs(
title = "Average Rating per Number of Movies Watched",
x = "Number of Movie's Watched",
y = 'Average Rating') +
theme_classic()
Due to this data set being so small and having many missing rating values, there is not enough evidence to draw any meaningful conclusions. Some potential insights include:
To improve this analysis, I would recommend collecting more movie ratings via surveying a much larger population. A much larger data set will provide a better representation of overall movie ratings, as well as provide opportunity to apply a different approach to addressing missing rating values.