Assignment 3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

# Load the movies dataset
movies <- read_csv("https://gist.githubusercontent.com/tiangechen/b68782efa49a16edaf07dc2cdaa855ea/raw/0c794a9717f18b094eabab2cd6a6b9a226903577/movies.csv")

## Rows: 77 Columns: 8

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Film, Genre, Lead Studio, Worldwide Gross
## dbl (4): Audience score %, Profitability, Rotten Tomatoes %, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 1

Rename the “Film” column to “movie_title” and “Year” to “release_year”.

q1 <- movies %>%
  rename(movie_title = Film , 
         release_year = Year)

head(q1)

## # A tibble: 6 × 8
##   movie_title               Genre `Lead Studio` `Audience score %` Profitability
##   <chr>                     <chr> <chr>                      <dbl>         <dbl>
## 1 Zack and Miri Make a Por… Roma… The Weinstei…                 70          1.75
## 2 Youth in Revolt           Come… The Weinstei…                 52          1.09
## 3 You Will Meet a Tall Dar… Come… Independent                   35          1.21
## 4 When in Rome              Come… Disney                        44          0   
## 5 What Happens in Vegas     Come… Fox                           72          6.27
## 6 Water For Elephants       Drama 20th Century…                 72          3.08
## # ℹ 3 more variables: `Rotten Tomatoes %` <dbl>, `Worldwide Gross` <chr>,
## #   release_year <dbl>

Question 2

Create a new dataframe with only the columns: movie_title, release_year, Genre, Profitability,

q2 <- q1 %>%
  select(movie_title, release_year, Genre, Profitability)
print(head(q2))

## # A tibble: 6 × 4
##   movie_title                        release_year Genre   Profitability
##   <chr>                                     <dbl> <chr>           <dbl>
## 1 Zack and Miri Make a Porno                 2008 Romance          1.75
## 2 Youth in Revolt                            2010 Comedy           1.09
## 3 You Will Meet a Tall Dark Stranger         2010 Comedy           1.21
## 4 When in Rome                               2010 Comedy           0   
## 5 What Happens in Vegas                      2008 Comedy           6.27
## 6 Water For Elephants                        2011 Drama            3.08

Question 3

Filter the dataset to include only movies released after 2000 with a Rotten Tomatoes % higher than 80.

q3 <- q1 %>%
  filter(release_year > 2008 & `Rotten Tomatoes %` > 80)
print(q3)

## # A tibble: 7 × 8
##   movie_title          Genre     `Lead Studio`  `Audience score %` Profitability
##   <chr>                <chr>     <chr>                       <dbl>         <dbl>
## 1 Tangled              Animation Disney                         88         1.37 
## 2 My Week with Marilyn Drama     The Weinstein…                 84         0.826
## 3 Midnight in Paris    Romence   Sony                           84         8.74 
## 4 Jane Eyre            Romance   Universal                      77         0    
## 5 Beginners            Comedy    Independent                    80         4.47 
## 6 A Serious Man        Drama     Universal                      64         4.38 
## 7 (500) Days of Summer comedy    Fox                            81         8.10 
## # ℹ 3 more variables: `Rotten Tomatoes %` <dbl>, `Worldwide Gross` <chr>,
## #   release_year <dbl>

Question 4

Add a new column called “Profitability_millions” that converts the Profitability to millions of dollars.

q4 <- q1 %>%
  mutate(Profitability_millions = Profitability * 1000000)
print(select(q4, Profitability, Profitability_millions))

## # A tibble: 77 × 2
##    Profitability Profitability_millions
##            <dbl>                  <dbl>
##  1         1.75                1747542.
##  2         1.09                1090000 
##  3         1.21                1211818.
##  4         0                         0 
##  5         6.27                6267647.
##  6         3.08                3081421.
##  7         2.90                2896019.
##  8        11.1                11089742.
##  9         0.005                  5000 
## 10         4.18                4184038.
## # ℹ 67 more rows

Question 5

Sort the filtered dataset by Rotten Tomatoes % in descending order, and then by Profitability in descending order.

q5 <- q4 %>%
  arrange(desc(`Rotten Tomatoes %`) , desc(Profitability_millions))
print(select(q5, `Rotten Tomatoes %`, Profitability_millions))

## # A tibble: 77 × 2
##    `Rotten Tomatoes %` Profitability_millions
##                  <dbl>                  <dbl>
##  1                  96               2896019.
##  2                  93               8744706.
##  3                  93               4005737.
##  4                  91               6636402.
##  5                  89              11089742.
##  6                  89               4382857.
##  7                  89               1365692.
##  8                  87               8096000 
##  9                  85               1384167.
## 10                  85                     0 
## # ℹ 67 more rows

Question 6

Use the pipe operator (%>%) to chain these operations together, starting with the original dataset and ending with a final dataframe that incorporates all the above transformations.

q6 <- movies %>%
  rename(movie_title = Film , 
         release_year = Year) %>% 
  select(movie_title, release_year, Genre, Profitability , `Rotten Tomatoes %` ) %>%
  filter(release_year > 2008 & `Rotten Tomatoes %` > 80) %>%
  mutate(Profitability_millions = Profitability * 1000000)  %>%
  arrange(desc(`Rotten Tomatoes %`) , desc(Profitability_millions))
head(q6)

## # A tibble: 6 × 6
##   movie_title          release_year Genre     Profitability `Rotten Tomatoes %`
##   <chr>                       <dbl> <chr>             <dbl>               <dbl>
## 1 Midnight in Paris            2011 Romence            8.74                  93
## 2 A Serious Man                2009 Drama              4.38                  89
## 3 Tangled                      2010 Animation          1.37                  89
## 4 (500) Days of Summer         2009 comedy             8.10                  87
## 5 Jane Eyre                    2011 Romance            0                     85
## 6 Beginners                    2011 Comedy             4.47                  84
## # ℹ 1 more variable: Profitability_millions <dbl>

Question 7

From the resulting data, are the best movies the most popular?

From the resulting data, we are able to determine that the best movies with the highest rotten tomatoes score are not the most populare movies, as the most profitable movies tend to have lower rotten tomatoe scores compared to the higher scores that thend to be less profitable as the data shows.

Extra Credit

Create a summary dataframe that shows the average rating and Profitability_millions for movies by Genre. Hint: You’ll need to use group_by() and summarize().

EC <- q6 %>%
  group_by(Genre) %>%
  summarize(
    Avg_Rating = mean(`Rotten Tomatoes %`, na.rm = TRUE),
    Avg_Profitability = mean(Profitability_millions, na.rm = TRUE))
head(EC)

## # A tibble: 6 × 3
##   Genre     Avg_Rating Avg_Profitability
##   <chr>          <dbl>             <dbl>
## 1 Animation         89          1365692.
## 2 Comedy            84          4471875 
## 3 Drama             86          2604329.
## 4 Romance           85                0 
## 5 Romence           93          8744706.
## 6 comedy            87          8096000

Assignment 3

Paul Schiavone

2025-02-10

Question 1

Rename the “Film” column to “movie_title” and “Year” to “release_year”.

Question 2

Create a new dataframe with only the columns: movie_title, release_year, Genre, Profitability,

Question 3

Filter the dataset to include only movies released after 2000 with a Rotten Tomatoes % higher than 80.

Question 4

Add a new column called “Profitability_millions” that converts the Profitability to millions of dollars.

Question 5

Sort the filtered dataset by Rotten Tomatoes % in descending order, and then by Profitability in descending order.

Question 6

Use the pipe operator (%>%) to chain these operations together, starting with the original dataset and ending with a final dataframe that incorporates all the above transformations.

Question 7

From the resulting data, are the best movies the most popular?

From the resulting data, we are able to determine that the best movies with the highest rotten tomatoes score are not the most populare movies, as the most profitable movies tend to have lower rotten tomatoe scores compared to the higher scores that thend to be less profitable as the data shows.

Extra Credit

Create a summary dataframe that shows the average rating and Profitability_millions for movies by Genre. Hint: You’ll need to use group_by() and summarize().