name rating genre year
1 The Shining R Drama 1980
2 The Blue Lagoon R Adventure 1980
3 Star Wars: Episode V - The Empire Strikes Back PG Action 1980
4 Airplane! PG Comedy 1980
5 Caddyshack R Comedy 1980
6 Friday the 13th R Horror 1980
7 The Blues Brothers R Action 1980
8 Raging Bull R Biography 1980
9 Superman II PG Action 1980
10 The Long Riders R Biography 1980
released score votes director
1 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick
2 July 2, 1980 (United States) 5.8 65000 Randal Kleiser
3 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner
4 July 2, 1980 (United States) 7.7 221000 Jim Abrahams
5 July 25, 1980 (United States) 7.3 108000 Harold Ramis
6 May 9, 1980 (United States) 6.4 123000 Sean S. Cunningham
7 June 20, 1980 (United States) 7.9 188000 John Landis
8 December 19, 1980 (United States) 8.2 330000 Martin Scorsese
9 June 19, 1981 (United States) 6.8 101000 Richard Lester
10 May 16, 1980 (United States) 7.0 10000 Walter Hill
writer star country budget gross
1 Stephen King Jack Nicholson United Kingdom 1.9e+07 46998772
2 Henry De Vere Stacpoole Brooke Shields United States 4.5e+06 58853106
3 Leigh Brackett Mark Hamill United States 1.8e+07 538375067
4 Jim Abrahams Robert Hays United States 3.5e+06 83453539
5 Brian Doyle-Murray Chevy Chase United States 6.0e+06 39846344
6 Victor Miller Betsy Palmer United States 5.5e+05 39754601
7 Dan Aykroyd John Belushi United States 2.7e+07 115229890
8 Jake LaMotta Robert De Niro United States 1.8e+07 23402427
9 Jerry Siegel Gene Hackman United States 5.4e+07 108185706
10 Bill Bryden David Carradine United States 1.0e+07 15795189
company runtime
1 Warner Bros. 146
2 Columbia Pictures 104
3 Lucasfilm 124
4 Paramount Pictures 88
5 Orion Pictures 98
6 Paramount Pictures 95
7 Universal Pictures 133
8 Chartoff-Winkler Productions 129
9 Dovemead Films 127
10 United Artists 100
Clean the Data
The dataset is mostly clean, however a few columns could be represented in a better format. Rating, genre, and country should all be factors as they are categorical options rather than a continuous spectrum.
movies <-mutate(movies_raw, rating =factor(rating),genre =factor(genre),rating =factor(rating))head(movies, 10)
name rating genre year
1 The Shining R Drama 1980
2 The Blue Lagoon R Adventure 1980
3 Star Wars: Episode V - The Empire Strikes Back PG Action 1980
4 Airplane! PG Comedy 1980
5 Caddyshack R Comedy 1980
6 Friday the 13th R Horror 1980
7 The Blues Brothers R Action 1980
8 Raging Bull R Biography 1980
9 Superman II PG Action 1980
10 The Long Riders R Biography 1980
released score votes director
1 June 13, 1980 (United States) 8.4 927000 Stanley Kubrick
2 July 2, 1980 (United States) 5.8 65000 Randal Kleiser
3 June 20, 1980 (United States) 8.7 1200000 Irvin Kershner
4 July 2, 1980 (United States) 7.7 221000 Jim Abrahams
5 July 25, 1980 (United States) 7.3 108000 Harold Ramis
6 May 9, 1980 (United States) 6.4 123000 Sean S. Cunningham
7 June 20, 1980 (United States) 7.9 188000 John Landis
8 December 19, 1980 (United States) 8.2 330000 Martin Scorsese
9 June 19, 1981 (United States) 6.8 101000 Richard Lester
10 May 16, 1980 (United States) 7.0 10000 Walter Hill
writer star country budget gross
1 Stephen King Jack Nicholson United Kingdom 1.9e+07 46998772
2 Henry De Vere Stacpoole Brooke Shields United States 4.5e+06 58853106
3 Leigh Brackett Mark Hamill United States 1.8e+07 538375067
4 Jim Abrahams Robert Hays United States 3.5e+06 83453539
5 Brian Doyle-Murray Chevy Chase United States 6.0e+06 39846344
6 Victor Miller Betsy Palmer United States 5.5e+05 39754601
7 Dan Aykroyd John Belushi United States 2.7e+07 115229890
8 Jake LaMotta Robert De Niro United States 1.8e+07 23402427
9 Jerry Siegel Gene Hackman United States 5.4e+07 108185706
10 Bill Bryden David Carradine United States 1.0e+07 15795189
company runtime
1 Warner Bros. 146
2 Columbia Pictures 104
3 Lucasfilm 124
4 Paramount Pictures 88
5 Orion Pictures 98
6 Paramount Pictures 95
7 Universal Pictures 133
8 Chartoff-Winkler Productions 129
9 Dovemead Films 127
10 United Artists 100
Next, the column “released” currently has data which is a combination of a date and a country, like June 13, 1980 (United States). In order to tidy the data, we should split this column into a release_date column and a release_country column.
movies <-separate(movies, released, into =c("release_date", "release_country"), sep =" \\(|\\)") %>%mutate(release_country =factor(release_country), release_date =mdy(release_date))
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [5729,
5731].
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = mdy(release_date)`.
Caused by warning:
! 10 failed to parse.
head(movies, 10)
name rating genre year
1 The Shining R Drama 1980
2 The Blue Lagoon R Adventure 1980
3 Star Wars: Episode V - The Empire Strikes Back PG Action 1980
4 Airplane! PG Comedy 1980
5 Caddyshack R Comedy 1980
6 Friday the 13th R Horror 1980
7 The Blues Brothers R Action 1980
8 Raging Bull R Biography 1980
9 Superman II PG Action 1980
10 The Long Riders R Biography 1980
release_date release_country score votes director
1 1980-06-13 United States 8.4 927000 Stanley Kubrick
2 1980-07-02 United States 5.8 65000 Randal Kleiser
3 1980-06-20 United States 8.7 1200000 Irvin Kershner
4 1980-07-02 United States 7.7 221000 Jim Abrahams
5 1980-07-25 United States 7.3 108000 Harold Ramis
6 1980-05-09 United States 6.4 123000 Sean S. Cunningham
7 1980-06-20 United States 7.9 188000 John Landis
8 1980-12-19 United States 8.2 330000 Martin Scorsese
9 1981-06-19 United States 6.8 101000 Richard Lester
10 1980-05-16 United States 7.0 10000 Walter Hill
writer star country budget gross
1 Stephen King Jack Nicholson United Kingdom 1.9e+07 46998772
2 Henry De Vere Stacpoole Brooke Shields United States 4.5e+06 58853106
3 Leigh Brackett Mark Hamill United States 1.8e+07 538375067
4 Jim Abrahams Robert Hays United States 3.5e+06 83453539
5 Brian Doyle-Murray Chevy Chase United States 6.0e+06 39846344
6 Victor Miller Betsy Palmer United States 5.5e+05 39754601
7 Dan Aykroyd John Belushi United States 2.7e+07 115229890
8 Jake LaMotta Robert De Niro United States 1.8e+07 23402427
9 Jerry Siegel Gene Hackman United States 5.4e+07 108185706
10 Bill Bryden David Carradine United States 1.0e+07 15795189
company runtime
1 Warner Bros. 146
2 Columbia Pictures 104
3 Lucasfilm 124
4 Paramount Pictures 88
5 Orion Pictures 98
6 Paramount Pictures 95
7 Universal Pictures 133
8 Chartoff-Winkler Productions 129
9 Dovemead Films 127
10 United Artists 100
We note that there are two instances where released is empty, and then 10 instances where mdy fails to parse the date because only the year or nothing is given:
To account for this, we can simply make the release date be equal to the release year starting at January 1st which is given to us as the column “year”.
name rating genre year
1 The Shining R Drama 1980
2 The Blue Lagoon R Adventure 1980
3 Star Wars: Episode V - The Empire Strikes Back PG Action 1980
4 Airplane! PG Comedy 1980
5 Caddyshack R Comedy 1980
6 Friday the 13th R Horror 1980
7 The Blues Brothers R Action 1980
8 Raging Bull R Biography 1980
9 Superman II PG Action 1980
10 The Long Riders R Biography 1980
release_date release_country score votes director
1 1980-06-13 United States 8.4 927000 Stanley Kubrick
2 1980-07-02 United States 5.8 65000 Randal Kleiser
3 1980-06-20 United States 8.7 1200000 Irvin Kershner
4 1980-07-02 United States 7.7 221000 Jim Abrahams
5 1980-07-25 United States 7.3 108000 Harold Ramis
6 1980-05-09 United States 6.4 123000 Sean S. Cunningham
7 1980-06-20 United States 7.9 188000 John Landis
8 1980-12-19 United States 8.2 330000 Martin Scorsese
9 1981-06-19 United States 6.8 101000 Richard Lester
10 1980-05-16 United States 7.0 10000 Walter Hill
writer star country budget gross
1 Stephen King Jack Nicholson United Kingdom 1.9e+07 46998772
2 Henry De Vere Stacpoole Brooke Shields United States 4.5e+06 58853106
3 Leigh Brackett Mark Hamill United States 1.8e+07 538375067
4 Jim Abrahams Robert Hays United States 3.5e+06 83453539
5 Brian Doyle-Murray Chevy Chase United States 6.0e+06 39846344
6 Victor Miller Betsy Palmer United States 5.5e+05 39754601
7 Dan Aykroyd John Belushi United States 2.7e+07 115229890
8 Jake LaMotta Robert De Niro United States 1.8e+07 23402427
9 Jerry Siegel Gene Hackman United States 5.4e+07 108185706
10 Bill Bryden David Carradine United States 1.0e+07 15795189
company runtime
1 Warner Bros. 146
2 Columbia Pictures 104
3 Lucasfilm 124
4 Paramount Pictures 88
5 Orion Pictures 98
6 Paramount Pictures 95
7 Universal Pictures 133
8 Chartoff-Winkler Productions 129
9 Dovemead Films 127
10 United Artists 100
Provide a Narrative
This dataset is a list of 7,688 movies released from 1980 to 2020. For each year, the most popular 200 (with some exceptions) movies from that year are recorded. For each movie, there are 16 attributes recorded:
name: the title of the movie (string)
rating: the rating of the movie (R, PG, etc.) (factor)
genre: the main genre of the movie (factor)
year: year of release (double)
release_date: the date the movie was first released (date)
release_country: the country the movie was first released in (factor)
score: IMDb user rating (double)
votes: number of user votes (double)
director: the director (string)
writer: writer of the movie (string)
star: main actor/actress (string)
country: the country of origin (factor)
budget: the budget of a movie. Some movies don’t have this, so it appears as 0 (double)
gross: the box office revenue of the movie in the US (double)
company: the production company (string)
runtime: duration of the movie (double)
Descriptive Statistics
First, we can calculate the distribution of ratings, genre
print(table(movies$rating))
Approved G NC-17 Not Rated PG PG-13 R
77 1 153 23 283 1252 2112 3697
TV-14 TV-MA TV-PG Unrated X
1 9 5 52 3
print(round(prop.table(table(movies$rating)), 3))
Approved G NC-17 Not Rated PG PG-13 R
0.010 0.000 0.020 0.003 0.037 0.163 0.275 0.482
TV-14 TV-MA TV-PG Unrated X
0.000 0.001 0.001 0.007 0.000
We see that the most common ratings are “R”, “PG-13”, and “PG”. Additionally, we see that there are 77 movies where the rating was the empty string ““, which may indicate a field we want to replace with NA. Additionally,”Not Rated” and “Unrated” could likely be merged into a single descriptor.
print(table(movies$genre))
Action Adventure Animation Biography Comedy Crime Drama Family
1705 427 338 443 2245 551 1518 11
Fantasy History Horror Music Musical Mystery Romance Sci-Fi
44 1 322 1 2 20 10 10
Sport Thriller Western
1 16 3
print(round(prop.table(table(movies$genre)), 3))
Action Adventure Animation Biography Comedy Crime Drama Family
0.222 0.056 0.044 0.058 0.293 0.072 0.198 0.001
Fantasy History Horror Music Musical Mystery Romance Sci-Fi
0.006 0.000 0.042 0.000 0.000 0.003 0.001 0.001
Sport Thriller Western
0.000 0.002 0.000
We see that the distribution of main genre heavily favors Comedy, Action, and Drama. Only a single movie has history, music, or sport listed as the main genre. This is likely because while many movies do include these genres, it is very rarely listed as the primary genre of the movie.
We can see that the earliest years and 2020 do not have a full 200 movies recorded, likely due to a lack of availability of data to scrape.
table(movies$country)
Argentina
3 8
Aruba Australia
1 92
Austria Belgium
5 8
Brazil Canada
6 190
Chile China
2 40
Colombia Czech Republic
1 8
Denmark Federal Republic of Yugoslavia
32 2
Finland France
3 279
Germany Greece
117 2
Hong Kong Hungary
45 6
Iceland India
2 62
Indonesia Iran
2 10
Ireland Israel
43 5
Italy Jamaica
61 1
Japan Kenya
81 1
Lebanon Libya
1 1
Malta Mexico
1 22
Netherlands New Zealand
12 25
Norway Panama
12 1
Philippines Poland
3 4
Portugal Republic of Macedonia
2 1
Romania Russia
1 8
Serbia South Africa
1 8
South Korea Soviet Union
35 2
Spain Sweden
47 25
Switzerland Taiwan
10 7
Thailand Turkey
6 3
United Arab Emirates United Kingdom
2 816
United States Vietnam
5475 2
West Germany Yugoslavia
12 5
print("")
[1] ""
table(movies$release_country)
Argentina Australia
13 48
Austria Bahamas
1 1
Bahrain Belgium
1 15
Brazil Bulgaria
17 2
Cameroon Canada
1 32
China Croatia
11 3
Czech Republic Denmark
1 22
Federal Republic of Yugoslavia Finland
1 2
France Germany
148 46
Greece Hong Kong
12 12
Hungary Iceland
6 6
India Indonesia
31 1
Iran Ireland
4 12
Israel Italy
12 30
Japan Kazakhstan
44 2
Kuwait Latvia
1 1
Lebanon Malaysia
2 1
Mexico Netherlands
8 11
New Zealand Norway
6 12
Philippines Poland
4 7
Portugal Puerto Rico
5 1
Romania Russia
1 18
Singapore South Africa
7 5
South Korea Soviet Union
30 1
Spain Sweden
31 21
Taiwan Thailand
4 4
Turkey Ukraine
5 1
United Arab Emirates United Kingdom
1 197
United States Uruguay
6735 1
Vietnam West Germany
1 6
Yugoslavia
1
We can see that the vast majority of movies both originate from and initially release in the United States.
mean(movies$score, na.rm =TRUE)
[1] 6.390411
quantile(movies$score, na.rm =TRUE)
0% 25% 50% 75% 100%
1.9 5.8 6.5 7.1 9.3
sd(movies$score, na.rm =TRUE)
[1] 0.9688416
We see that the mean score across all movies was a 6.39, which is slightly higher than the median score of 6.5 indicated by the quartiles. The lowest movie score was 1.9 and the highest movie score was 9.3.
mean(movies$votes, na.rm =TRUE)
[1] 88108.5
quantile(movies$votes, na.rm =TRUE)
0% 25% 50% 75% 100%
7 9100 33000 93000 2400000
sd(movies$votes, na.rm =TRUE)
[1] 163323.8
The mean number of votes cast for a movie was 88 thousand, which is significantly higher than the median votes cast of 33 thousand. The fewest votes cast was 7, while the maximum votes cast was 2,400,000. The standard deviation was quite considerable at 163,323, indicating that the data is not well concentrated.
We see that the average budget of a movie is $35,589,876 while the median budget is significantly lower at $20,500,000. The smallest movie budget was $3000 while the largest was $356,000,000.
We see that the average gross of a movie is $78,500,541 while the median gross is significantly lower at $20,205,757. The smallest movie gross was $309 while the largest was $2,847,246,203.
We can construct an estimation of the profit as the movie’s gross minus its budget. However since gross tracks the US domestic gross rather than the global, this is more of an estimation than an exact number. We see that the average profit of a movie is $67,065,821 while the median gross is significantly lower at $13,766,118. The biggest movie loss was $158,031,147 while the largest gain was $2,610,246,203
mean(movies$runtime, na.rm =TRUE)
[1] 107.2616
quantile(movies$runtime, na.rm =TRUE)
0% 25% 50% 75% 100%
55 95 104 116 366
sd(movies$runtime, na.rm =TRUE)
[1] 18.58125
The average length of a movie was 107.2616 minutes, while the median was lower at 104 minutes. The shortest movie was 55 minutes, and the longest was 366.
Potential Research Questions
One potentially interesting research question would be to see how the profit of movies has changed over time. For instance, services like Netflix might be making seeing a movie less appealing to viewers, so the total box office returns of movies may be going down with time.
We could also try to see how well the rating of a movie correlates to various factors. For instance, do movies with bigger budgets generally have higher ratings? How about movies with bigger gross’s? How have the ratings of the top 200 movies changed over time? Are longer movies generally better than shorter movies?
Lastly, we could see how much of a factor “name recognition” plays in the success of a movie. For instance, we could take the top 10 most common directors (or writers or stars) and see how well their movies were rated compared to the rest.