The movies data set has 44010 rows about the amount of explicit
content (drugs, language, sex, nudity, and violence) found in 1467
movies released since 1958. Each movie is represented by 30 rows (1 row
= movie & tag_name
type combo).
The relevant variables in the data set are:
## # A tibble: 44,010 × 12
## imdb_id name title_main title_subscript year rating run_time studio
## <chr> <chr> <chr> <chr> <int> <chr> <int> <chr>
## 1 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 2 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 3 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 4 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 5 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 6 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 7 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 8 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 9 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## 10 tt0052357 Vertigo Vertigo "" 1958 PG 7680 Universal
## # ℹ 44,000 more rows
## # ℹ 4 more variables: category <chr>, tag_name <chr>, occurrences <int>,
## # occur_duration <int>
Create a data set named movies2 that has the following rows:
Additionally, movies2 should only have the
imdb_id
, name
, year
,
rating
, run_time
, studio
,
category
, tag_name
, occurrences
,
and occur_duration
columns.
The code chunk below will display a random selection of rows to be used to check if question 1 was done correctly:
## # A tibble: 17,124 × 10
## imdb_id name year rating run_time studio category tag_name occurrences
## <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <int>
## 1 tt1124037 Free S… 2016 R 8400 Unive… sexual sex_wit… 0
## 2 tt5437928 Colette 2018 R 6720 Kille… sexual sex_wit… 6
## 3 tt12003946 Violen… 2022 R 6720 Unive… violence graphic 144
## 4 tt3501112 Securi… 2017 R 5220 Mille… language sexual_… 2
## 5 tt0765429 Americ… 2007 R 9420 Unive… sexual sex_wit… 0
## 6 tt4943322 The Ot… 2017 R 6600 Easte… language profani… 3
## 7 tt14444726 TAR 2022 R 9480 Stand… violence disturb… 0
## 8 tt6684714 Acts o… 2018 R 5160 Colec… sexual sex_wit… 0
## 9 tt1462758 Buried 2010 R 5640 Lions… language profani… 98
## 10 tt0837156 Pee-We… 2016 PG 5400 SYOSS… sexual sex_wit… 0
## # ℹ 17,114 more rows
## # ℹ 1 more variable: occur_duration <int>
If you’re unable to complete question 1, you can use the “movies q2.csv” data set in Brightspace.
Change the run_time and category columns in the movies2 data set as following:
run_time
: Change the values from seconds to
minutes. Round to the closest minutecategory
: Reduce the number of groups from 5 to
4 by combining sexual and immodesty into 1 group -
sex/nudityMake sure to use the appropriate dplyr
verb(s)!
Show the 10 rows with the most occurrences. Just display the name of the movie, run_time, category, tag_name, and occurrences (the movies2 data set should still have all 10 columns)
## name run_time category tag_name occurrences
## 1 Uncut Gems 136 language profanity 883
## 2 The Wolf of Wall Street 180 language profanity 743
## 3 This Is the End 107 language profanity 586
## 4 Casino 179 language profanity 541
## 5 End of Watch 109 language profanity 502
## 6 8 Mile 111 language profanity 479
## 7 They Cloned Tyrone 122 language profanity 478
## 8 Cherry 140 language profanity 461
## 9 Reservoir Dogs 99 language profanity 457
## 10 Malcolm & Marie 106 language profanity 455
What tag do all 10 movies with the most occurrences have?
The 10 tags with the most occurrences are all profanity
If you were unable to complete question 3, you can use the “movies q4.csv” data set for this question
Using the movies_summary data set the the relevant dplyr verbs, create a graph that has the categories on the y-axis and the average number of occurrences on the x-axis, represented by a bar. See the graph in Brightspace!