Project 2

Author

Samantha Barbaro

Approach

I’ve created a table with movie rating data from six definitely real people in SQL. I will create a tidy data frame that reads nulls correctly into R.

I will use the tidyverse to inspect the data, find out how many missing values there are, and find general descriptive statistics (potentially breaking out different facets of data, for example, just superhero movies). I’ll find out which movie(s) were seen the most/least, as well as the average rating for each movie. I will also define my approach for null values in this data set (for example, while it might be okay to exclude null values when calculating the average rating for each movie, is this approach acceptable for calculating the average rating for superhero movies in general?).

Data set:

https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv

SQL queries that created the data set are available here:

https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/Project_2_queries.sql

Loading the data set

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sql_url <- "https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv"

#read and convert blanks and NULL to NA

movie_data <- read.csv(sql_url, na = c("", "NA", "null", "NULL"))

glimpse(movie_data)

Rows: 6
Columns: 11
$ viewer            <chr> "anne", "cat", "greta", "james", "jeff", "nico"
$ iron_man          <int> 1, NA, NA, 2, NA, 5
$ captain_america   <int> 2, NA, 4, 5, NA, NA
$ avengers          <int> 3, NA, 3, NA, 4, 2
$ super_man         <int> NA, 5, NA, 2, 1, 2
$ birds_of_prey     <int> 5, 4, NA, NA, NA, 5
$ the_suicide_squad <int> 4, NA, NA, 4, NA, 4
$ the_grinch        <int> 1, 2, NA, 3, NA, NA
$ the_holiday       <int> 1, 4, NA, 2, NA, NA
$ love_actually     <int> 2, NA, NA, 5, 5, NA
$ rudolph           <int> 1, 4, NA, NA, 5, NA

Converting to a tibble

#converting to tibble

movie_data <- as_tibble(movie_data) 

print(movie_data)

# A tibble: 6 × 11
  viewer iron_man captain_america avengers super_man birds_of_prey
  <chr>     <int>           <int>    <int>     <int>         <int>
1 anne          1               2        3        NA             5
2 cat          NA              NA       NA         5             4
3 greta        NA               4        3        NA            NA
4 james         2               5       NA         2            NA
5 jeff         NA              NA        4         1            NA
6 nico          5              NA        2         2             5
# ℹ 5 more variables: the_suicide_squad <int>, the_grinch <int>,
#   the_holiday <int>, love_actually <int>, rudolph <int>

Pivot longer

#this may be required to analyze the data (making the data into a 2-column format)

long_movie_data <- movie_data |> pivot_longer(
  -viewer,
  names_to = "movie",
  values_to = "rating"
)

#get some descriptive statistics
summary(long_movie_data)

    viewer             movie               rating     
 Length:60          Length:60          Min.   :1.000  
 Class :character   Class :character   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :3.500  
                                       Mean   :3.188  
                                       3rd Qu.:4.250  
                                       Max.   :5.000  
                                       NA's   :28

Approach for NA values

For the purposes of counting which movies were seen the most and determining average ratings, I will remove null values. I don’t want to count null values when counting the number of times a movie was seen – this will give me incorrect results, same for treating null ratings as zeroes when determining averages.

Average rating for each movie

#average rating for each movie (rounding an arranging by avg rating)
avg_movie <- long_movie_data |>  group_by(movie) |> 
  summarize(avg_rating = round(mean(rating, na.rm = TRUE), 1)) |> arrange(desc(avg_rating))

print(avg_movie)

# A tibble: 10 × 2
   movie             avg_rating
   <chr>                  <dbl>
 1 birds_of_prey            4.7
 2 love_actually            4  
 3 the_suicide_squad        4  
 4 captain_america          3.7
 5 rudolph                  3.3
 6 avengers                 3  
 7 iron_man                 2.7
 8 super_man                2.5
 9 the_holiday              2.3
10 the_grinch               2

#a table without NA values

movies_no_na <- long_movie_data |> 
    drop_na()

#graph 
ggplot(data = avg_movie, aes(x = reorder(movie, avg_rating), y = avg_rating, fill = movie)) +
    geom_col() +
    coord_flip() +
    labs(x = "Movie", y = "Average Rating", title = "Average Movie Rating")

Which movies were seen the most?

movies_seen <- long_movie_data |> 
  group_by(movie) |> 
  summarize(total_seen = sum(!is.na(rating))) |> 
  arrange(desc(total_seen))

print(movies_seen)

# A tibble: 10 × 2
   movie             total_seen
   <chr>                  <int>
 1 avengers                   4
 2 super_man                  4
 3 birds_of_prey              3
 4 captain_america            3
 5 iron_man                   3
 6 love_actually              3
 7 rudolph                    3
 8 the_grinch                 3
 9 the_holiday                3
10 the_suicide_squad          3

Who saw the most movies?

by_viewer <- long_movie_data |> 
    group_by(viewer) |> 
    summarize(total_watched = sum(!is.na(rating))) |> 
    arrange(desc(total_watched))

print(by_viewer)

# A tibble: 6 × 2
  viewer total_watched
  <chr>          <int>
1 anne               9
2 james              7
3 cat                5
4 nico               5
5 jeff               4
6 greta              2

Superhero movies vs. holiday movies movies

#categorizing the  movies as superhero or holiday

avg_movie_categorized <- avg_movie %>%
    mutate(category = if_else(
        movie %in% c("the_grinch", "rudolph", "the_holiday", "love_actually"), 
        "Holiday", 
        "Superhero"
    ))

print(avg_movie_categorized)

# A tibble: 10 × 3
   movie             avg_rating category 
   <chr>                  <dbl> <chr>    
 1 birds_of_prey            4.7 Superhero
 2 love_actually            4   Holiday  
 3 the_suicide_squad        4   Superhero
 4 captain_america          3.7 Superhero
 5 rudolph                  3.3 Holiday  
 6 avengers                 3   Superhero
 7 iron_man                 2.7 Superhero
 8 super_man                2.5 Superhero
 9 the_holiday              2.3 Holiday  
10 the_grinch               2   Holiday

#plotting average ratings of categorized movies
#clearly villain movies are the favorite

ggplot(data = avg_movie_categorized, 
       aes(x = reorder(movie, avg_rating), y = avg_rating, fill = category)) +
    geom_col() +
    coord_flip() +
    # Manually set the colors for categories
    scale_fill_manual(values = c("Holiday" = "forestgreen", "Superhero" = "steelblue")) +
    labs(x = "Movie", y = "Average Rating", fill = "Category") +
    theme_minimal()

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed Feb 6, 2026.