DS 2870 - Homework 4

Data Description

The movies data set has 44010 rows about the amount of explicit content (drugs, language, sex, nudity, and violence) found in 1467 movies released since 1958. Each movie is represented by 30 rows (1 row = movie & tag_name type combo).

The relevant variables in the data set are:

imdb_id: The identifier used by IMDB to uniquely specify the movie
name, title_main, title_subscript: The name, main title, and subtitle of the movie. name = “title_main: title_subscript”
year: The year the movie was released
rating: The MPAA rating of the movie (PG/PG-13/R)
run_time: The duration of the movie (in seconds)
studio: The studio that released the movie
category: The type of explicit content - language/violence/immodesty/sexual/drugs/other
tag_name: A subcategory of the type of explicit content. There are 30 different values of tag_name
- See the “tags.csv” to see what each tag_name represents about the type of explicit content in the movie
occurrences: The number of times/scenes of the tag_name type of content in the movie
occur_duration: The length of time of the scenes for the tag_name (in seconds)

## # A tibble: 44,010 × 12
##    imdb_id   name    title_main title_subscript  year rating run_time studio   
##    <chr>     <chr>   <chr>      <chr>           <int> <chr>     <int> <chr>    
##  1 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  2 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  3 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  4 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  5 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  6 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  7 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  8 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
##  9 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
## 10 tt0052357 Vertigo Vertigo    ""               1958 PG         7680 Universal
## # ℹ 44,000 more rows
## # ℹ 4 more variables: category <chr>, tag_name <chr>, occurrences <int>,
## #   occur_duration <int>

Question 1: Removing unwanted rows and columns

Create a data set named movies2 that has the following rows:

Only movies released in 1985 and after
Not in the “other” category
Doesn’t have one of the following tags:
- “childish_language”, “blasphemy”, “nudity_implied”, “immodesty”, “drugs_implied”, “drugs_legal”, “non_graphic”, “violence_implied”, “kissing_normal”, “kissing_passion”, “sex_implied”, “sexually_suggestive”

Additionally, movies2 should only have the imdb_id, name, year, rating, run_time, studio, category, tag_name, occurrences, and occur_duration columns.

The code chunk below will display a random selection of rows to be used to check if question 1 was done correctly:

## # A tibble: 17,124 × 10
##    imdb_id    name     year rating run_time studio category tag_name occurrences
##    <chr>      <chr>   <int> <chr>     <int> <chr>  <chr>    <chr>          <int>
##  1 tt1124037  Free S…  2016 R          8400 Unive… sexual   sex_wit…           0
##  2 tt5437928  Colette  2018 R          6720 Kille… sexual   sex_wit…           6
##  3 tt12003946 Violen…  2022 R          6720 Unive… violence graphic          144
##  4 tt3501112  Securi…  2017 R          5220 Mille… language sexual_…           2
##  5 tt0765429  Americ…  2007 R          9420 Unive… sexual   sex_wit…           0
##  6 tt4943322  The Ot…  2017 R          6600 Easte… language profani…           3
##  7 tt14444726 TAR      2022 R          9480 Stand… violence disturb…           0
##  8 tt6684714  Acts o…  2018 R          5160 Colec… sexual   sex_wit…           0
##  9 tt1462758  Buried   2010 R          5640 Lions… language profani…          98
## 10 tt0837156  Pee-We…  2016 PG         5400 SYOSS… sexual   sex_wit…           0
## # ℹ 17,114 more rows
## # ℹ 1 more variable: occur_duration <int>

Question 2: Changing some of the columns

If you’re unable to complete question 1, you can use the “movies q2.csv” data set in Brightspace.

Change the run_time and category columns in the movies2 data set as following:

run_time: Change the values from seconds to minutes. Round to the closest minute
category: Reduce the number of groups from 5 to 4 by combining sexual and immodesty into 1 group - sex/nudity

Make sure to use the appropriate dplyr verb(s)!

Show the 10 rows with the most occurrences. Just display the name of the movie, run_time, category, tag_name, and occurrences (the movies2 data set should still have all 10 columns)

##                       name run_time category  tag_name occurrences
## 1               Uncut Gems      136 language profanity         883
## 2  The Wolf of Wall Street      180 language profanity         743
## 3          This Is the End      107 language profanity         586
## 4                   Casino      179 language profanity         541
## 5             End of Watch      109 language profanity         502
## 6                   8 Mile      111 language profanity         479
## 7       They Cloned Tyrone      122 language profanity         478
## 8                   Cherry      140 language profanity         461
## 9           Reservoir Dogs       99 language profanity         457
## 10         Malcolm & Marie      106 language profanity         455

What tag do all 10 movies with the most occurrences have?

The 10 tags with the most occurrences are all profanity

Question 3: Combining tags in the same category

If you were unable to complete question 2, you can use the “movies q3.csv” data set for this question

For this question, we want the data only included the category and ignore the tags. That is, if a movie had three “graphic” scenes and 2 scenes with “gore”, we want to combine those scenes together since those 2 tags are both under the “violence” category, and the movie will have 5 violent scenes.

Using the movies2 data set, create a data frame called movies_summary that has the following columns:

imdb_id: The IMDB identifier of the movie
name: the name of the movie
rating: The MPAA rating of the movie (PG/PG-13/R)
run_time: The length of the movie (in minutes)
category: The type of explicit content (language, violence, sex/nudity, drugs)
occurrences: The total number of scenes in the movie of the category type
occur_duration: The length (in seconds) of all scenes in the movie of the type of violent content
occurred: Either the value “yes” if there is at least 1 scene of that content type in the movie and “no” if there are 0 explicit scenes of that type in the movie

Hint: You should be creating the columns in the order specified using 2 different dplyr verbs

Use tibble() to display the first 10 rows

## # A tibble: 5,708 × 8
##    imdb_id   name   rating run_time category occurrences occur_duration occurred
##    <chr>     <chr>  <chr>     <dbl> <chr>          <int>          <int> <chr>   
##  1 tt8368408 Gunpo… R           114 violence          53            330 yes     
##  2 tt7395114 The D… R           138 drugs             12             48 yes     
##  3 tt1745960 Top G… PG-13       131 language          69             78 yes     
##  4 tt0384642 Kicki… PG           95 language          11             12 yes     
##  5 tt7401588 Insta… PG-13       118 sex/nud…           5             12 yes     
##  6 tt1426328 Leonie PG-13       132 sex/nud…           9            126 yes     
##  7 tt6472976 Five … PG-13       116 violence           0              0 no      
##  8 tt5442430 Life   R           104 drugs              0              0 no      
##  9 tt2283336 Men i… PG-13       115 language          40             42 yes     
## 10 tt1838556 Hones… PG-13        99 sex/nud…           0              0 no      
## # ℹ 5,698 more rows

Use the code chunk below to check that you did it successfully. Each movie should only occur 4 times (once for each category) and it should have 1427 rows

## # A tibble: 1,427 × 3
##    imdb_id   name                           n
##    <chr>     <chr>                      <int>
##  1 tt0087231 The Falcon and the Snowman     4
##  2 tt0088763 Back to the Future             4
##  3 tt0088846 Brazil                         4
##  4 tt0088847 The Breakfast Club             4
##  5 tt0089155 Fletch                         4
##  6 tt0089880 Rambo: First Blood Part II     4
##  7 tt0090329 Witness                        4
##  8 tt0090357 Young Sherlock Holmes          4
##  9 tt0091042 Ferris Bueller's Day Off       4
## 10 tt0091129 The Golden Child               4
## # ℹ 1,417 more rows

Question 4

If you were unable to complete question 3, you can use the “movies q4.csv” data set for this question

Using the movies_summary data set the the relevant dplyr verbs, create a graph that has the categories on the y-axis and the average number of occurrences on the x-axis, represented by a bar. See the graph in Brightspace!

DS 2870 - Homework 4 - dplyr

Your Name

2025-05-22

Data Description

Question 1: Removing unwanted rows and columns

Question 2: Changing some of the columns

Question 3: Combining tags in the same category

Question 4