Movie Ratings Database, SQL, and R

This project uses movie ratings data of 6 films scrapped from 5 users of the website Letterboxd

The data was loaded into a SQLite database which allows for flat file storage of the data in the movie_ratings.db file located in this directory.

For more information about how the movie_ratings.db was created and populated with data please refer to the readme.md file in this directory. This file also contains the original source links for all of the data that was collected.

Movie Ratings Dataset

The movie ratings data is structured using 3 tables.

A movie table which contains information about the movie.
A reviewers table which contains reviewer information
A ratings table which contains foreign keys for both the movie id and the reviewer id along with the actual rating the user gave to the movie

Here is the data set in its entirety

Movie Ratings Dataset
	id	movie_name	release_date	director	runtime_minutes	budget_dollars	box_office_dollars	rating	username	user_display_name
1	1	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	4.5	kurstboy	karsten
2	1	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	3.5	bratpitt	BRAT
3	1	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	4.5	jay	Jay
4	1	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	4.0	deathproof	Lucy
5	1	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	NA	sapphicquinn	♦️•Lily•💋
6	2	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	3.5	kurstboy	karsten
7	2	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	2.5	bratpitt	BRAT
8	2	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	4.5	jay	Jay
9	2	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	3.0	deathproof	Lucy
10	2	Soul	2020-10-11	Pete Doctor	100	1.5e+08	119300000	NA	sapphicquinn	♦️•Lily•💋
11	3	Tenet	2020-09-03	Christopher Nolan	150	2.0e+08	363700000	4.5	kurstboy	karsten
12	3	Tenet	2020-09-03	Christopher Nolan	150	2.0e+08	363700000	3.5	bratpitt	BRAT
13	3	Tenet	2020-09-03	Christopher Nolan	150	2.0e+08	363700000	4.0	jay	Jay
14	3	Tenet	2020-09-03	Christopher Nolan	150	2.0e+08	363700000	2.5	deathproof	Lucy
15	3	Tenet	2020-09-03	Christopher Nolan	150	2.0e+08	363700000	3.0	sapphicquinn	♦️•Lily•💋
16	4	Nomadland	2021-02-19	Chloé Zhao	108	5.0e+06	27000000	3.5	kurstboy	karsten
17	4	Nomadland	2021-02-19	Chloé Zhao	108	5.0e+06	27000000	2.5	bratpitt	BRAT
18	4	Nomadland	2021-02-19	Chloé Zhao	108	5.0e+06	27000000	4.5	jay	Jay
19	4	Nomadland	2021-02-19	Chloé Zhao	108	5.0e+06	27000000	2.0	deathproof	Lucy
20	4	Nomadland	2021-02-19	Chloé Zhao	108	5.0e+06	27000000	3.5	sapphicquinn	♦️•Lily•💋
21	5	Promising Young Woman	2020-12-25	Emerald Fennell	108	1.0e+07	16700000	4.5	kurstboy	karsten
22	5	Promising Young Woman	2020-12-25	Emerald Fennell	108	1.0e+07	16700000	3.0	bratpitt	BRAT
23	5	Promising Young Woman	2020-12-25	Emerald Fennell	108	1.0e+07	16700000	3.0	jay	Jay
24	5	Promising Young Woman	2020-12-25	Emerald Fennell	108	1.0e+07	16700000	2.5	deathproof	Lucy
25	5	Promising Young Woman	2020-12-25	Emerald Fennell	108	1.0e+07	16700000	4.0	sapphicquinn	♦️•Lily•💋

Graphing

As you can see the data above contains empty values due to not every reviewer having seeing every film. I represented this by allowing nulls in the ratings table and inserting records for those users with null ratings. This is a bit of a contrived example because in the real world I would have just not inserted those records.

Graphing the data set without removing those null records resulted in a warning. In order to handle this I used the drop_na() function of the tidyr library.

Below is a boxplot with the distribution of ratings for each movie from our users.

SQL and R

Nick Oliver

Movie Ratings Database, SQL, and R

Movie Ratings Dataset

Graphing