Data Description

The movies data set has 44010 rows about the amount of explicit content (drugs, language, sex, nudity, and violence) found in 1467 movies released since 1958. Each movie is represented by 30 rows (1 row = movie & tag_name type combo).

The relevant variables in the data set are:

## # A tibble: 44,010 × 12
##    imdb_id   name    title_main title_subscript  year rating run_time studio   
##    <chr>     <chr>   <chr>      <chr>           <int> <chr>     <int> <chr>    
##  1 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  2 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  3 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  4 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  5 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  6 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  7 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  8 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  9 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
## 10 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
## # ℹ 44,000 more rows
## # ℹ 4 more variables: category <chr>, tag_name <chr>, occurrences <int>,
## #   occur_duration <int>

Question 1: Removing unwanted rows and columns

Create a data set named movies2 that has the following rows:

Additionally, movies2 should only have the imdb_ib, name, year, rating, run_time, studio, category, tag_name, occurrences, and occur_duration columns. Display the movies2 dataset using the tibble() function.

## # A tibble: 17,124 × 10
##    imdb_id   name      year rating run_time studio category tag_name occurrences
##    <chr>     <chr>    <int> <chr>     <int> <chr>  <chr>    <chr>          <int>
##  1 tt0087231 The Fal…  1985 R          7920 Orion… violence disturb…           0
##  2 tt0087231 The Fal…  1985 R          7920 Orion… language sexual_…           3
##  3 tt0087231 The Fal…  1985 R          7920 Orion… violence graphic            1
##  4 tt0087231 The Fal…  1985 R          7920 Orion… immodes… nudity_…          17
##  5 tt0087231 The Fal…  1985 R          7920 Orion… language profani…          42
##  6 tt0087231 The Fal…  1985 R          7920 Orion… immodes… nudity_…           1
##  7 tt0087231 The Fal…  1985 R          7920 Orion… language racial_…           1
##  8 tt0087231 The Fal…  1985 R          7920 Orion… sexual   sexual_…           0
##  9 tt0087231 The Fal…  1985 R          7920 Orion… drugs    drugs_i…           9
## 10 tt0087231 The Fal…  1985 R          7920 Orion… violence gore               0
## # ℹ 17,114 more rows
## # ℹ 1 more variable: occur_duration <int>

Question 2: Changing some of the columns

If you’re unable to complete question 1, you can use the “movies q2.csv” data set in Brightspace.

Change the run_time and category columns in the movies2 data set as following:

Make sure to use the appropriate dplyr verb(s)!

Show the 10 rows with the most occurrences. Just display the name of the movie, run_time, category, tag_name, and occurrences (the movies2 data set should still have all 10 columns)

##                       name run_time category  tag_name occurrences
## 1               Uncut Gems      136 language profanity         883
## 2  The Wolf of Wall Street      180 language profanity         743
## 3          This Is the End      107 language profanity         586
## 4                   Casino      179 language profanity         541
## 5             End of Watch      109 language profanity         502
## 6                   8 Mile      111 language profanity         479
## 7       They Cloned Tyrone      122 language profanity         478
## 8                   Cherry      140 language profanity         461
## 9           Reservoir Dogs       99 language profanity         457
## 10         Malcolm & Marie      106 language profanity         455

What tag do all 10 movies with the most occurrences have?

The 10 tags with the most occurrences are all profanity

Question 3: Combining tags in the same category

If you were unable to complete question 2, you can use the “movies q3.csv” data set for this question

For this question, we want the data only included the category and ignored the tags. That is, if a movie had three “graphic” scenes and 2 scences with “gore”, we want to combine those scenes together since those 2 tags are both under the “violence” category, and the movie will have 5 violent scenes.

Using the movies2 data set, create a data frame called movies_summary that has the following columns:

  1. imdb_id: The IMDB identifier of the movie
  2. name: the name of the movie
  3. rating: The MPAA rating of the movie (PG/PG-13/R)
  4. run_time: The length of the movie (in minutes)
  5. category: The type of explicit content (language, violence, sex/nudity, drugs)
  6. occurrences: The total number of scenes in the movie of the category type
  7. occur_duration: The length (in seconds) of all scenes in the movie of the type of violent content
  8. occurred: Either the value “yes” if there is at least 1 scene of that content type in the movie and “no” if there are 0 explicit scenes of that type in the movie

Hint: You should be creating the columns in the order specified using 2 different dplyr verbs

Use tibble() to display the first 10 rows

## # A tibble: 5,708 × 8
##    imdb_id   name   rating run_time category occurrences occur_duration occurred
##    <chr>     <chr>  <chr>     <dbl> <chr>          <int>          <int> <chr>   
##  1 tt0087231 The F… R           132 violence           1              6 yes     
##  2 tt0087231 The F… R           132 language          46             48 yes     
##  3 tt0087231 The F… R           132 sex/nud…          18            132 yes     
##  4 tt0087231 The F… R           132 drugs              9             24 yes     
##  5 tt0088763 Back … PG          116 violence           2              0 yes     
##  6 tt0088763 Back … PG          116 language          60             66 yes     
##  7 tt0088763 Back … PG          116 sex/nud…           8            120 yes     
##  8 tt0088763 Back … PG          116 drugs              6             48 yes     
##  9 tt0088846 Brazil R           143 violence          22            174 yes     
## 10 tt0088846 Brazil R           143 language          45             66 yes     
## # ℹ 5,698 more rows

Use the code chunk below to check that you did it successfully. Each movie should only occur 4 times (once for each category) and it should have 1427 rows

## # A tibble: 1,427 × 3
##    imdb_id   name                           n
##    <chr>     <chr>                      <int>
##  1 tt0087231 The Falcon and the Snowman     4
##  2 tt0088763 Back to the Future             4
##  3 tt0088846 Brazil                         4
##  4 tt0088847 The Breakfast Club             4
##  5 tt0089155 Fletch                         4
##  6 tt0089880 Rambo: First Blood Part II     4
##  7 tt0090329 Witness                        4
##  8 tt0090357 Young Sherlock Holmes          4
##  9 tt0091042 Ferris Bueller's Day Off       4
## 10 tt0091129 The Golden Child               4
## # ℹ 1,417 more rows

Question 4

If you were unable to complete question 3, you can use the “movies q4.csv” data set for this question

Using the movies_summary data set the the relevant dplyr verbs, create a graph that has the categories on the y-axis and the average number of occurrences on the x-axis, represented by a bar. See the graph in Brightspace!