SQL and R

Nick Oliver

Movie Ratings Database, SQL, and R

This project uses movie ratings data of 6 films scrapped from 5 users of the website Letterboxd

The data was loaded into a SQLite database which allows for flat file storage of the data in the movie_ratings.db file located in this directory.

For more information about how the movie_ratings.db was created and populated with data please refer to the readme.md file in this directory. This file also contains the original source links for all of the data that was collected.

Movie Ratings Dataset

The movie ratings data is structured using 3 tables.

  1. A movie table which contains information about the movie.
  2. A reviewers table which contains reviewer information
  3. A ratings table which contains foreign keys for both the movie id and the reviewer id along with the actual rating the user gave to the movie

Here is the data set in its entirety

Movie Ratings Dataset
id movie_name release_date director runtime_minutes budget_dollars box_office_dollars rating username user_display_name
1 1 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 4.5 kurstboy karsten
2 1 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 3.5 bratpitt BRAT
3 1 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 4.5 jay Jay
4 1 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 4.0 deathproof Lucy
5 1 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 NA sapphicquinn ♦️•Lily•💋
6 2 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 3.5 kurstboy karsten
7 2 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 2.5 bratpitt BRAT
8 2 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 4.5 jay Jay
9 2 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 3.0 deathproof Lucy
10 2 Soul 2020-10-11 Pete Doctor 100 1.5e+08 119300000 NA sapphicquinn ♦️•Lily•💋
11 3 Tenet 2020-09-03 Christopher Nolan 150 2.0e+08 363700000 4.5 kurstboy karsten
12 3 Tenet 2020-09-03 Christopher Nolan 150 2.0e+08 363700000 3.5 bratpitt BRAT
13 3 Tenet 2020-09-03 Christopher Nolan 150 2.0e+08 363700000 4.0 jay Jay
14 3 Tenet 2020-09-03 Christopher Nolan 150 2.0e+08 363700000 2.5 deathproof Lucy
15 3 Tenet 2020-09-03 Christopher Nolan 150 2.0e+08 363700000 3.0 sapphicquinn ♦️•Lily•💋
16 4 Nomadland 2021-02-19 Chloé Zhao 108 5.0e+06 27000000 3.5 kurstboy karsten
17 4 Nomadland 2021-02-19 Chloé Zhao 108 5.0e+06 27000000 2.5 bratpitt BRAT
18 4 Nomadland 2021-02-19 Chloé Zhao 108 5.0e+06 27000000 4.5 jay Jay
19 4 Nomadland 2021-02-19 Chloé Zhao 108 5.0e+06 27000000 2.0 deathproof Lucy
20 4 Nomadland 2021-02-19 Chloé Zhao 108 5.0e+06 27000000 3.5 sapphicquinn ♦️•Lily•💋
21 5 Promising Young Woman 2020-12-25 Emerald Fennell 108 1.0e+07 16700000 4.5 kurstboy karsten
22 5 Promising Young Woman 2020-12-25 Emerald Fennell 108 1.0e+07 16700000 3.0 bratpitt BRAT
23 5 Promising Young Woman 2020-12-25 Emerald Fennell 108 1.0e+07 16700000 3.0 jay Jay
24 5 Promising Young Woman 2020-12-25 Emerald Fennell 108 1.0e+07 16700000 2.5 deathproof Lucy
25 5 Promising Young Woman 2020-12-25 Emerald Fennell 108 1.0e+07 16700000 4.0 sapphicquinn ♦️•Lily•💋

Graphing

As you can see the data above contains empty values due to not every reviewer having seeing every film. I represented this by allowing nulls in the ratings table and inserting records for those users with null ratings. This is a bit of a contrived example because in the real world I would have just not inserted those records.

Graphing the data set without removing those null records resulted in a warning. In order to handle this I used the drop_na() function of the tidyr library.

Below is a boxplot with the distribution of ratings for each movie from our users.