The file is actually semicolon separated, but it is simple to read in using read_delim. An initial look shows that overall the data is well structured, but would benefit from collapsing the genre data.
raw_data <- read_delim('https://raw.githubusercontent.com/Tillmawitz/data_607/refs/heads/main/project_2/movies.csv', delim = ";")
## Rows: 3883 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (1): Name
## dbl (20): ReleaseDate, Action, Adventure, Children, Comedy, Crime, Documenta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw_data
## # A tibble: 3,883 × 21
## Name ReleaseDate Action Adventure Children Comedy Crime Documentary Drama
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Toy Sto… 1995 0 0 1 1 0 0 0
## 2 Jumanji… 1995 0 1 1 0 0 0 0
## 3 Grumpie… 1995 0 0 0 1 0 0 0
## 4 Waiting… 1995 0 0 0 1 0 0 1
## 5 Father … 1995 0 0 0 1 0 0 0
## 6 Heat (1… 1995 1 0 0 0 1 0 0
## 7 Sabrina… 1995 0 0 0 1 0 0 0
## 8 Tom and… 1995 0 1 1 0 0 0 0
## 9 Sudden … 1995 1 0 0 0 0 0 0
## 10 GoldenE… 1995 1 1 0 0 0 0 0
## # ℹ 3,873 more rows
## # ℹ 12 more variables: Fantasy <dbl>, Noir <dbl>, Horror <dbl>, Musical <dbl>,
## # Mystery <dbl>, Romance <dbl>, SciFi <dbl>, Thriller <dbl>, War <dbl>,
## # Western <dbl>, AvgRating <dbl>, Watches <dbl>
Tidying the data is fairly simple and can be achieved by pivoting the genre column names into a single column called “genre” and mapping the values to a separate column. By filtering the rows containing the genres the movie is not in we have a nice data set tracking all the relavant information.
tidied <- raw_data |>
pivot_longer(cols = Action:Western, names_to = "genre", values_to = "in_genre") |>
filter(in_genre == 1) |>
select(!in_genre)
tidied
## # A tibble: 6,303 × 5
## Name ReleaseDate AvgRating Watches genre
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Toy Story (1995) 1995 4.15 2077 Children
## 2 Toy Story (1995) 1995 4.15 2077 Comedy
## 3 Jumanji (1995) 1995 3.2 701 Adventure
## 4 Jumanji (1995) 1995 3.2 701 Children
## 5 Jumanji (1995) 1995 3.2 701 Fantasy
## 6 Grumpier Old Men (1995) 1995 3.02 478 Comedy
## 7 Grumpier Old Men (1995) 1995 3.02 478 Romance
## 8 Waiting to Exhale (1995) 1995 2.73 170 Comedy
## 9 Waiting to Exhale (1995) 1995 2.73 170 Drama
## 10 Father of the Bride Part II (1995) 1995 3.01 296 Comedy
## # ℹ 6,293 more rows
It should be noted that even this format has a fair amount of repeated information. The trade off is worth it to make aggregations and analysis easier, but outside of the analytic environment the data would be better stored in two different tables. By separating the repeated columns (all except genre) the information can be more easily updated without the risk of diverging edits. The table used for analysis can be easily reconstructed using a join on the Name column.
movie_information <- tidied |>
select(!genre) |>
distinct()
genre_information <- tidied |>
select(Name, genre)
The original post (https://brightspace.cuny.edu/d2l/le/437179/discussions/threads/452064/View) has some suggested analysis that we can perform, and some additional analysis is provided for fun. First we can take a look at the popularity by category. This is a rather subjective measure, as popularity could be tracked by metrics such as viewership or ratings. If we chose viewership alone we would likely want to adjust the numbers to account for population growth over time, but that data has not been included. Simply taking an average of movie ratings would give undue weight to movies with very low viewership numbers and unfairly penalize movies with high viewership rates. It is therefore best to use a weighted average of the scores, weighted by the overall viewership. Doing so shows us that there is not a huge spread in scores with only half a point separating the most popular genre, Noir, from the least popular, Horror. This indicates the data could probably benefit from being rescaled, but given the numerous schools of thought on this we will leave this for now.
tidied |>
group_by(genre) |>
summarise(weighted_rating = weighted.mean(AvgRating, Watches)) |>
ggplot(aes(weighted_rating, genre)) +
geom_col() +
labs(title = "Weighted Average of Genre Scores", x = "Rating (out of 5)", y = "Genre")
Moving on, we can take a look at how the release rates of the genres vary over time to begin to explore genre popularity. The data has been windowed to years after 1980 to allow modern trends to be more apparent. The number of releases increases dramatically with time, so trends in the early years of the dataset are masked by the scale of later trends. Looking at the plot it is clear the genres with the most releases over this period are Drama and Comedy, a bit of an amusing juxtaposition.
tidied |>
filter(ReleaseDate > 1980) |>
count(genre, ReleaseDate) |>
ggplot(aes(x = ReleaseDate, y = n, color = genre)) +
geom_line() +
labs(title = "Unique Releases Since 1980", x = "ReleaseYear", y = "Number of Releases", color = "Genre")
If we instead use viewership as a metric of popularity we see the same two genres on top, but the gap between them and the other genres is decreasing. This indicates the viewership of the top genres may be inflated by having more overall releases.
tidied |>
filter(ReleaseDate > 1980) |>
group_by(genre, ReleaseDate) |>
summarise(total_views = sum(Watches)) |>
ggplot(aes(x = ReleaseDate, y = total_views, color = genre)) +
geom_line() +
labs(title = "Genre Viewership Since 1980", x = "ReleaseYear", y = "Total Views", color = "Genre")
## `summarise()` has grouped output by 'genre'. You can override using the
## `.groups` argument.
By measuring viewership per release we can get a scaled metric that accounts for viewership inflation due to the number of releases. Doing so shows us that by and large the genres are equally popular. There are some spikes on a given year but overall the genres vary within a very similar range and trade places frequently.
viewership_counts <- tidied |>
filter(ReleaseDate > 1980) |>
group_by(genre, ReleaseDate) |>
count() |>
rename(releases = n)
tidied |>
filter(ReleaseDate > 1980) |>
group_by(genre, ReleaseDate) |>
summarise(total_views = sum(Watches)) |>
full_join(viewership_counts) |>
mutate(views_per_release = total_views/releases) |>
ggplot(aes(x = ReleaseDate, y = views_per_release, color = genre)) +
geom_line() +
labs(x = "Release Year", y = "Views per Release", title = "Viewership Rates By Genre", color = "Genre")
## `summarise()` has grouped output by 'genre'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(genre, ReleaseDate)`
As a fun additional look we can see that on average a movie fits into 1.6 genres and the most genres a movie had was 5.
tidied |>
count(Name) |>
summarise(average = mean(n), max = max(n))
## # A tibble: 1 × 2
## average max
## <dbl> <int>
## 1 1.62 5