HW1

Import the Dataset

For my dataset, I chose to use a movies dataset from Kaggle containing 7,688 movies from 1980 to 2020.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies_raw = read.csv("movies.csv")
head(movies_raw, 10)
                                             name rating     genre year
1                                     The Shining      R     Drama 1980
2                                 The Blue Lagoon      R Adventure 1980
3  Star Wars: Episode V - The Empire Strikes Back     PG    Action 1980
4                                       Airplane!     PG    Comedy 1980
5                                      Caddyshack      R    Comedy 1980
6                                 Friday the 13th      R    Horror 1980
7                              The Blues Brothers      R    Action 1980
8                                     Raging Bull      R Biography 1980
9                                     Superman II     PG    Action 1980
10                                The Long Riders      R Biography 1980
                            released score   votes           director
1      June 13, 1980 (United States)   8.4  927000    Stanley Kubrick
2       July 2, 1980 (United States)   5.8   65000     Randal Kleiser
3      June 20, 1980 (United States)   8.7 1200000     Irvin Kershner
4       July 2, 1980 (United States)   7.7  221000       Jim Abrahams
5      July 25, 1980 (United States)   7.3  108000       Harold Ramis
6        May 9, 1980 (United States)   6.4  123000 Sean S. Cunningham
7      June 20, 1980 (United States)   7.9  188000        John Landis
8  December 19, 1980 (United States)   8.2  330000    Martin Scorsese
9      June 19, 1981 (United States)   6.8  101000     Richard Lester
10      May 16, 1980 (United States)   7.0   10000        Walter Hill
                    writer            star        country  budget     gross
1             Stephen King  Jack Nicholson United Kingdom 1.9e+07  46998772
2  Henry De Vere Stacpoole  Brooke Shields  United States 4.5e+06  58853106
3           Leigh Brackett     Mark Hamill  United States 1.8e+07 538375067
4             Jim Abrahams     Robert Hays  United States 3.5e+06  83453539
5       Brian Doyle-Murray     Chevy Chase  United States 6.0e+06  39846344
6            Victor Miller    Betsy Palmer  United States 5.5e+05  39754601
7              Dan Aykroyd    John Belushi  United States 2.7e+07 115229890
8             Jake LaMotta  Robert De Niro  United States 1.8e+07  23402427
9             Jerry Siegel    Gene Hackman  United States 5.4e+07 108185706
10             Bill Bryden David Carradine  United States 1.0e+07  15795189
                        company runtime
1                  Warner Bros.     146
2             Columbia Pictures     104
3                     Lucasfilm     124
4            Paramount Pictures      88
5                Orion Pictures      98
6            Paramount Pictures      95
7            Universal Pictures     133
8  Chartoff-Winkler Productions     129
9                Dovemead Films     127
10               United Artists     100

Clean the Data

The dataset is mostly clean, however a few columns could be represented in a better format. Rating, genre, and country should all be factors as they are categorical options rather than a continuous spectrum.

movies <- mutate(movies_raw, rating = factor(rating),
                 genre = factor(genre),
                 rating = factor(rating))
head(movies, 10)
                                             name rating     genre year
1                                     The Shining      R     Drama 1980
2                                 The Blue Lagoon      R Adventure 1980
3  Star Wars: Episode V - The Empire Strikes Back     PG    Action 1980
4                                       Airplane!     PG    Comedy 1980
5                                      Caddyshack      R    Comedy 1980
6                                 Friday the 13th      R    Horror 1980
7                              The Blues Brothers      R    Action 1980
8                                     Raging Bull      R Biography 1980
9                                     Superman II     PG    Action 1980
10                                The Long Riders      R Biography 1980
                            released score   votes           director
1      June 13, 1980 (United States)   8.4  927000    Stanley Kubrick
2       July 2, 1980 (United States)   5.8   65000     Randal Kleiser
3      June 20, 1980 (United States)   8.7 1200000     Irvin Kershner
4       July 2, 1980 (United States)   7.7  221000       Jim Abrahams
5      July 25, 1980 (United States)   7.3  108000       Harold Ramis
6        May 9, 1980 (United States)   6.4  123000 Sean S. Cunningham
7      June 20, 1980 (United States)   7.9  188000        John Landis
8  December 19, 1980 (United States)   8.2  330000    Martin Scorsese
9      June 19, 1981 (United States)   6.8  101000     Richard Lester
10      May 16, 1980 (United States)   7.0   10000        Walter Hill
                    writer            star        country  budget     gross
1             Stephen King  Jack Nicholson United Kingdom 1.9e+07  46998772
2  Henry De Vere Stacpoole  Brooke Shields  United States 4.5e+06  58853106
3           Leigh Brackett     Mark Hamill  United States 1.8e+07 538375067
4             Jim Abrahams     Robert Hays  United States 3.5e+06  83453539
5       Brian Doyle-Murray     Chevy Chase  United States 6.0e+06  39846344
6            Victor Miller    Betsy Palmer  United States 5.5e+05  39754601
7              Dan Aykroyd    John Belushi  United States 2.7e+07 115229890
8             Jake LaMotta  Robert De Niro  United States 1.8e+07  23402427
9             Jerry Siegel    Gene Hackman  United States 5.4e+07 108185706
10             Bill Bryden David Carradine  United States 1.0e+07  15795189
                        company runtime
1                  Warner Bros.     146
2             Columbia Pictures     104
3                     Lucasfilm     124
4            Paramount Pictures      88
5                Orion Pictures      98
6            Paramount Pictures      95
7            Universal Pictures     133
8  Chartoff-Winkler Productions     129
9                Dovemead Films     127
10               United Artists     100

Next, the column “released” currently has data which is a combination of a date and a country, like June 13, 1980 (United States). In order to tidy the data, we should split this column into a release_date column and a release_country column.

movies <- separate(movies, released, into = c("release_date", "release_country"), sep = " \\(|\\)") %>% mutate(release_country = factor(release_country), release_date = mdy(release_date))
Warning: Expected 2 pieces. Additional pieces discarded in 7666 rows [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [5729,
5731].
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = mdy(release_date)`.
Caused by warning:
!  10 failed to parse.
head(movies, 10)
                                             name rating     genre year
1                                     The Shining      R     Drama 1980
2                                 The Blue Lagoon      R Adventure 1980
3  Star Wars: Episode V - The Empire Strikes Back     PG    Action 1980
4                                       Airplane!     PG    Comedy 1980
5                                      Caddyshack      R    Comedy 1980
6                                 Friday the 13th      R    Horror 1980
7                              The Blues Brothers      R    Action 1980
8                                     Raging Bull      R Biography 1980
9                                     Superman II     PG    Action 1980
10                                The Long Riders      R Biography 1980
   release_date release_country score   votes           director
1    1980-06-13   United States   8.4  927000    Stanley Kubrick
2    1980-07-02   United States   5.8   65000     Randal Kleiser
3    1980-06-20   United States   8.7 1200000     Irvin Kershner
4    1980-07-02   United States   7.7  221000       Jim Abrahams
5    1980-07-25   United States   7.3  108000       Harold Ramis
6    1980-05-09   United States   6.4  123000 Sean S. Cunningham
7    1980-06-20   United States   7.9  188000        John Landis
8    1980-12-19   United States   8.2  330000    Martin Scorsese
9    1981-06-19   United States   6.8  101000     Richard Lester
10   1980-05-16   United States   7.0   10000        Walter Hill
                    writer            star        country  budget     gross
1             Stephen King  Jack Nicholson United Kingdom 1.9e+07  46998772
2  Henry De Vere Stacpoole  Brooke Shields  United States 4.5e+06  58853106
3           Leigh Brackett     Mark Hamill  United States 1.8e+07 538375067
4             Jim Abrahams     Robert Hays  United States 3.5e+06  83453539
5       Brian Doyle-Murray     Chevy Chase  United States 6.0e+06  39846344
6            Victor Miller    Betsy Palmer  United States 5.5e+05  39754601
7              Dan Aykroyd    John Belushi  United States 2.7e+07 115229890
8             Jake LaMotta  Robert De Niro  United States 1.8e+07  23402427
9             Jerry Siegel    Gene Hackman  United States 5.4e+07 108185706
10             Bill Bryden David Carradine  United States 1.0e+07  15795189
                        company runtime
1                  Warner Bros.     146
2             Columbia Pictures     104
3                     Lucasfilm     124
4            Paramount Pictures      88
5                Orion Pictures      98
6            Paramount Pictures      95
7            Universal Pictures     133
8  Chartoff-Winkler Productions     129
9                Dovemead Films     127
10               United Artists     100

We note that there are two instances where released is empty, and then 10 instances where mdy fails to parse the date because only the year or nothing is given:

movies_raw[is.na(movies$release_date), ][5]
                 released
202  1981 (United States)
313          1982 (Japan)
787         1985 (Taiwan)
801  1985 (United States)
1174 1987 (United States)
1821 1990 (United States)
1826        1990 (Canada)
2817          1995 (Iran)
4188 2019 (United States)
5729                     
5731                     
6414 2013 (United States)

To account for this, we can simply make the release date be equal to the release year starting at January 1st which is given to us as the column “year”.

movies$release_date <- if_else(is.na(movies$release_date), mdy(paste("January 1,", movies$year)), movies$release_date)
head(movies, 10)
                                             name rating     genre year
1                                     The Shining      R     Drama 1980
2                                 The Blue Lagoon      R Adventure 1980
3  Star Wars: Episode V - The Empire Strikes Back     PG    Action 1980
4                                       Airplane!     PG    Comedy 1980
5                                      Caddyshack      R    Comedy 1980
6                                 Friday the 13th      R    Horror 1980
7                              The Blues Brothers      R    Action 1980
8                                     Raging Bull      R Biography 1980
9                                     Superman II     PG    Action 1980
10                                The Long Riders      R Biography 1980
   release_date release_country score   votes           director
1    1980-06-13   United States   8.4  927000    Stanley Kubrick
2    1980-07-02   United States   5.8   65000     Randal Kleiser
3    1980-06-20   United States   8.7 1200000     Irvin Kershner
4    1980-07-02   United States   7.7  221000       Jim Abrahams
5    1980-07-25   United States   7.3  108000       Harold Ramis
6    1980-05-09   United States   6.4  123000 Sean S. Cunningham
7    1980-06-20   United States   7.9  188000        John Landis
8    1980-12-19   United States   8.2  330000    Martin Scorsese
9    1981-06-19   United States   6.8  101000     Richard Lester
10   1980-05-16   United States   7.0   10000        Walter Hill
                    writer            star        country  budget     gross
1             Stephen King  Jack Nicholson United Kingdom 1.9e+07  46998772
2  Henry De Vere Stacpoole  Brooke Shields  United States 4.5e+06  58853106
3           Leigh Brackett     Mark Hamill  United States 1.8e+07 538375067
4             Jim Abrahams     Robert Hays  United States 3.5e+06  83453539
5       Brian Doyle-Murray     Chevy Chase  United States 6.0e+06  39846344
6            Victor Miller    Betsy Palmer  United States 5.5e+05  39754601
7              Dan Aykroyd    John Belushi  United States 2.7e+07 115229890
8             Jake LaMotta  Robert De Niro  United States 1.8e+07  23402427
9             Jerry Siegel    Gene Hackman  United States 5.4e+07 108185706
10             Bill Bryden David Carradine  United States 1.0e+07  15795189
                        company runtime
1                  Warner Bros.     146
2             Columbia Pictures     104
3                     Lucasfilm     124
4            Paramount Pictures      88
5                Orion Pictures      98
6            Paramount Pictures      95
7            Universal Pictures     133
8  Chartoff-Winkler Productions     129
9                Dovemead Films     127
10               United Artists     100

Provide a Narrative

This dataset is a list of 7,688 movies released from 1980 to 2020. For each year, the most popular 200 (with some exceptions) movies from that year are recorded. For each movie, there are 16 attributes recorded:

  • name: the title of the movie (string)

  • rating: the rating of the movie (R, PG, etc.) (factor)

  • genre: the main genre of the movie (factor)

  • year: year of release (double)

  • release_date: the date the movie was first released (date)

  • release_country: the country the movie was first released in (factor)

  • score: IMDb user rating (double)

  • votes: number of user votes (double)

  • director: the director (string)

  • writer: writer of the movie (string)

  • star: main actor/actress (string)

  • country: the country of origin (factor)

  • budget: the budget of a movie. Some movies don’t have this, so it appears as 0 (double)

  • gross: the box office revenue of the movie in the US (double)

  • company: the production company (string)

  • runtime: duration of the movie (double)

Descriptive Statistics

First, we can calculate the distribution of ratings, genre

print(table(movies$rating))

           Approved         G     NC-17 Not Rated        PG     PG-13         R 
       77         1       153        23       283      1252      2112      3697 
    TV-14     TV-MA     TV-PG   Unrated         X 
        1         9         5        52         3 
print(round(prop.table(table(movies$rating)), 3))

           Approved         G     NC-17 Not Rated        PG     PG-13         R 
    0.010     0.000     0.020     0.003     0.037     0.163     0.275     0.482 
    TV-14     TV-MA     TV-PG   Unrated         X 
    0.000     0.001     0.001     0.007     0.000 

We see that the most common ratings are “R”, “PG-13”, and “PG”. Additionally, we see that there are 77 movies where the rating was the empty string ““, which may indicate a field we want to replace with NA. Additionally,”Not Rated” and “Unrated” could likely be merged into a single descriptor.

print(table(movies$genre))

   Action Adventure Animation Biography    Comedy     Crime     Drama    Family 
     1705       427       338       443      2245       551      1518        11 
  Fantasy   History    Horror     Music   Musical   Mystery   Romance    Sci-Fi 
       44         1       322         1         2        20        10        10 
    Sport  Thriller   Western 
        1        16         3 
print(round(prop.table(table(movies$genre)), 3))

   Action Adventure Animation Biography    Comedy     Crime     Drama    Family 
    0.222     0.056     0.044     0.058     0.293     0.072     0.198     0.001 
  Fantasy   History    Horror     Music   Musical   Mystery   Romance    Sci-Fi 
    0.006     0.000     0.042     0.000     0.000     0.003     0.001     0.001 
    Sport  Thriller   Western 
    0.000     0.002     0.000 

We see that the distribution of main genre heavily favors Comedy, Action, and Drama. Only a single movie has history, music, or sport listed as the main genre. This is likely because while many movies do include these genres, it is very rarely listed as the primary genre of the movie.

table(movies$year)

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 
  92  113  126  144  168  200  200  200  200  200  200  200  200  200  200  200 
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
 200  200  200  200  200  200  200  200  200  200  200  200  200  200  200  200 
2012 2013 2014 2015 2016 2017 2018 2019 2020 
 200  200  200  200  200  200  200  200   25 

We can see that the earliest years and 2020 do not have a full 200 movies recorded, likely due to a lack of availability of data to scrape.

table(movies$country)

                                                    Argentina 
                             3                              8 
                         Aruba                      Australia 
                             1                             92 
                       Austria                        Belgium 
                             5                              8 
                        Brazil                         Canada 
                             6                            190 
                         Chile                          China 
                             2                             40 
                      Colombia                 Czech Republic 
                             1                              8 
                       Denmark Federal Republic of Yugoslavia 
                            32                              2 
                       Finland                         France 
                             3                            279 
                       Germany                         Greece 
                           117                              2 
                     Hong Kong                        Hungary 
                            45                              6 
                       Iceland                          India 
                             2                             62 
                     Indonesia                           Iran 
                             2                             10 
                       Ireland                         Israel 
                            43                              5 
                         Italy                        Jamaica 
                            61                              1 
                         Japan                          Kenya 
                            81                              1 
                       Lebanon                          Libya 
                             1                              1 
                         Malta                         Mexico 
                             1                             22 
                   Netherlands                    New Zealand 
                            12                             25 
                        Norway                         Panama 
                            12                              1 
                   Philippines                         Poland 
                             3                              4 
                      Portugal          Republic of Macedonia 
                             2                              1 
                       Romania                         Russia 
                             1                              8 
                        Serbia                   South Africa 
                             1                              8 
                   South Korea                   Soviet Union 
                            35                              2 
                         Spain                         Sweden 
                            47                             25 
                   Switzerland                         Taiwan 
                            10                              7 
                      Thailand                         Turkey 
                             6                              3 
          United Arab Emirates                 United Kingdom 
                             2                            816 
                 United States                        Vietnam 
                          5475                              2 
                  West Germany                     Yugoslavia 
                            12                              5 
print("")
[1] ""
table(movies$release_country)

                     Argentina                      Australia 
                            13                             48 
                       Austria                        Bahamas 
                             1                              1 
                       Bahrain                        Belgium 
                             1                             15 
                        Brazil                       Bulgaria 
                            17                              2 
                      Cameroon                         Canada 
                             1                             32 
                         China                        Croatia 
                            11                              3 
                Czech Republic                        Denmark 
                             1                             22 
Federal Republic of Yugoslavia                        Finland 
                             1                              2 
                        France                        Germany 
                           148                             46 
                        Greece                      Hong Kong 
                            12                             12 
                       Hungary                        Iceland 
                             6                              6 
                         India                      Indonesia 
                            31                              1 
                          Iran                        Ireland 
                             4                             12 
                        Israel                          Italy 
                            12                             30 
                         Japan                     Kazakhstan 
                            44                              2 
                        Kuwait                         Latvia 
                             1                              1 
                       Lebanon                       Malaysia 
                             2                              1 
                        Mexico                    Netherlands 
                             8                             11 
                   New Zealand                         Norway 
                             6                             12 
                   Philippines                         Poland 
                             4                              7 
                      Portugal                    Puerto Rico 
                             5                              1 
                       Romania                         Russia 
                             1                             18 
                     Singapore                   South Africa 
                             7                              5 
                   South Korea                   Soviet Union 
                            30                              1 
                         Spain                         Sweden 
                            31                             21 
                        Taiwan                       Thailand 
                             4                              4 
                        Turkey                        Ukraine 
                             5                              1 
          United Arab Emirates                 United Kingdom 
                             1                            197 
                 United States                        Uruguay 
                          6735                              1 
                       Vietnam                   West Germany 
                             1                              6 
                    Yugoslavia 
                             1 

We can see that the vast majority of movies both originate from and initially release in the United States.

mean(movies$score, na.rm = TRUE)
[1] 6.390411
quantile(movies$score, na.rm = TRUE)
  0%  25%  50%  75% 100% 
 1.9  5.8  6.5  7.1  9.3 
sd(movies$score, na.rm = TRUE)
[1] 0.9688416

We see that the mean score across all movies was a 6.39, which is slightly higher than the median score of 6.5 indicated by the quartiles. The lowest movie score was 1.9 and the highest movie score was 9.3.

mean(movies$votes, na.rm = TRUE)
[1] 88108.5
quantile(movies$votes, na.rm = TRUE)
     0%     25%     50%     75%    100% 
      7    9100   33000   93000 2400000 
sd(movies$votes, na.rm = TRUE)
[1] 163323.8

The mean number of votes cast for a movie was 88 thousand, which is significantly higher than the median votes cast of 33 thousand. The fewest votes cast was 7, while the maximum votes cast was 2,400,000. The standard deviation was quite considerable at 163,323, indicating that the data is not well concentrated.

length(unique(movies$director))
[1] 2949
string_counts <- table(movies$director)
string_counts[which.max(string_counts)]
Woody Allen 
         38 

We see that 2949 unique directors are captured by the dataset, and the most common director was Woody Allen with a total of 38 movies.

length(unique(movies$writer))
[1] 4536
string_counts <- table(movies$writer)
string_counts[which.max(string_counts)]
Woody Allen 
         37 

We see that 4536 unique writes are captured by the dataset, and the most common writer was again Woody Allen with a total of 37 movies.

length(unique(movies$star))
[1] 2815
string_counts <- table(movies$star)
string_counts[which.max(string_counts)]
Nicolas Cage 
          43 

We see that 2815 unique stars are captured by the dataset, and the most common star was Nicolas Cage with a total of 43 movies.

options(scipen = 999)
mean(movies$budget, na.rm = TRUE)
[1] 35589876
quantile(movies$budget, na.rm = TRUE)
       0%       25%       50%       75%      100% 
     3000  10000000  20500000  45000000 356000000 
sd(movies$budget, na.rm = TRUE)
[1] 41457297

We see that the average budget of a movie is $35,589,876 while the median budget is significantly lower at $20,500,000. The smallest movie budget was $3000 while the largest was $356,000,000.

mean(movies$gross, na.rm = TRUE)
[1] 78500541
quantile(movies$gross, na.rm = TRUE)
        0%        25%        50%        75%       100% 
       309    4532056   20205757   76016692 2847246203 
sd(movies$gross, na.rm = TRUE)
[1] 165725124

We see that the average gross of a movie is $78,500,541 while the median gross is significantly lower at $20,205,757. The smallest movie gross was $309 while the largest was $2,847,246,203.

profit_est <- movies$gross - movies$budget
mean(profit_est, na.rm = TRUE)
[1] 67065821
quantile(profit_est, na.rm = TRUE)
        0%        25%        50%        75%       100% 
-158031147   -3177509   13766118   70175840 2610246203 
sd(profit_est, na.rm = TRUE)
[1] 158818097

We can construct an estimation of the profit as the movie’s gross minus its budget. However since gross tracks the US domestic gross rather than the global, this is more of an estimation than an exact number. We see that the average profit of a movie is $67,065,821 while the median gross is significantly lower at $13,766,118. The biggest movie loss was $158,031,147 while the largest gain was $2,610,246,203

mean(movies$runtime, na.rm = TRUE)
[1] 107.2616
quantile(movies$runtime, na.rm = TRUE)
  0%  25%  50%  75% 100% 
  55   95  104  116  366 
sd(movies$runtime, na.rm = TRUE)
[1] 18.58125

The average length of a movie was 107.2616 minutes, while the median was lower at 104 minutes. The shortest movie was 55 minutes, and the longest was 366.

Potential Research Questions

  1. One potentially interesting research question would be to see how the profit of movies has changed over time. For instance, services like Netflix might be making seeing a movie less appealing to viewers, so the total box office returns of movies may be going down with time.
  2. We could also try to see how well the rating of a movie correlates to various factors. For instance, do movies with bigger budgets generally have higher ratings? How about movies with bigger gross’s? How have the ratings of the top 200 movies changed over time? Are longer movies generally better than shorter movies?
  3. Lastly, we could see how much of a factor “name recognition” plays in the success of a movie. For instance, we could take the top 10 most common directors (or writers or stars) and see how well their movies were rated compared to the rest.