Movie Ratings Database, SQL, and R
This project uses movie ratings data of 6 films scrapped from 5 users of the website Letterboxd
The data was loaded into a SQLite database which allows for flat file storage of the data in the movie_ratings.db file located in this directory.
For more information about how the movie_ratings.db was created and populated with data please refer to the readme.md file in this directory. This file also contains the original source links for all of the data that was collected.
Movie Ratings Dataset
The movie ratings data is structured using 3 tables.
- A movie table which contains information about the movie.
- A reviewers table which contains reviewer information
- A ratings table which contains foreign keys for both the movie id and the reviewer id along with the actual rating the user gave to the movie
Here is the data set in its entirety
| id | movie_name | release_date | director | runtime_minutes | budget_dollars | box_office_dollars | rating | username | user_display_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 4.5 | kurstboy | karsten |
| 2 | 1 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 3.5 | bratpitt | BRAT |
| 3 | 1 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 4.5 | jay | Jay |
| 4 | 1 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 4.0 | deathproof | Lucy |
| 5 | 1 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | NA | sapphicquinn | ♦️•Lily•💋 |
| 6 | 2 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 3.5 | kurstboy | karsten |
| 7 | 2 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 2.5 | bratpitt | BRAT |
| 8 | 2 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 4.5 | jay | Jay |
| 9 | 2 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | 3.0 | deathproof | Lucy |
| 10 | 2 | Soul | 2020-10-11 | Pete Doctor | 100 | 1.5e+08 | 119300000 | NA | sapphicquinn | ♦️•Lily•💋 |
| 11 | 3 | Tenet | 2020-09-03 | Christopher Nolan | 150 | 2.0e+08 | 363700000 | 4.5 | kurstboy | karsten |
| 12 | 3 | Tenet | 2020-09-03 | Christopher Nolan | 150 | 2.0e+08 | 363700000 | 3.5 | bratpitt | BRAT |
| 13 | 3 | Tenet | 2020-09-03 | Christopher Nolan | 150 | 2.0e+08 | 363700000 | 4.0 | jay | Jay |
| 14 | 3 | Tenet | 2020-09-03 | Christopher Nolan | 150 | 2.0e+08 | 363700000 | 2.5 | deathproof | Lucy |
| 15 | 3 | Tenet | 2020-09-03 | Christopher Nolan | 150 | 2.0e+08 | 363700000 | 3.0 | sapphicquinn | ♦️•Lily•💋 |
| 16 | 4 | Nomadland | 2021-02-19 | Chloé Zhao | 108 | 5.0e+06 | 27000000 | 3.5 | kurstboy | karsten |
| 17 | 4 | Nomadland | 2021-02-19 | Chloé Zhao | 108 | 5.0e+06 | 27000000 | 2.5 | bratpitt | BRAT |
| 18 | 4 | Nomadland | 2021-02-19 | Chloé Zhao | 108 | 5.0e+06 | 27000000 | 4.5 | jay | Jay |
| 19 | 4 | Nomadland | 2021-02-19 | Chloé Zhao | 108 | 5.0e+06 | 27000000 | 2.0 | deathproof | Lucy |
| 20 | 4 | Nomadland | 2021-02-19 | Chloé Zhao | 108 | 5.0e+06 | 27000000 | 3.5 | sapphicquinn | ♦️•Lily•💋 |
| 21 | 5 | Promising Young Woman | 2020-12-25 | Emerald Fennell | 108 | 1.0e+07 | 16700000 | 4.5 | kurstboy | karsten |
| 22 | 5 | Promising Young Woman | 2020-12-25 | Emerald Fennell | 108 | 1.0e+07 | 16700000 | 3.0 | bratpitt | BRAT |
| 23 | 5 | Promising Young Woman | 2020-12-25 | Emerald Fennell | 108 | 1.0e+07 | 16700000 | 3.0 | jay | Jay |
| 24 | 5 | Promising Young Woman | 2020-12-25 | Emerald Fennell | 108 | 1.0e+07 | 16700000 | 2.5 | deathproof | Lucy |
| 25 | 5 | Promising Young Woman | 2020-12-25 | Emerald Fennell | 108 | 1.0e+07 | 16700000 | 4.0 | sapphicquinn | ♦️•Lily•💋 |
Graphing
As you can see the data above contains empty values due to not every reviewer having seeing every film. I represented this by allowing nulls in the ratings table and inserting records for those users with null ratings. This is a bit of a contrived example because in the real world I would have just not inserted those records.
Graphing the data set without removing those null records resulted in a warning. In order to handle this I used the drop_na() function of the tidyr library.
Below is a boxplot with the distribution of ratings for each movie from our users.