Project 2

Author

Samantha Barbaro

Approach

I’ve created a table with movie rating data from six definitely real people in SQL. I will create a tidy data frame that reads nulls correctly into R.

I will use the tidyverse to inspect the data, find out how many missing values there are, and find general descriptive statistics (potentially breaking out different facets of data, for example, just superhero movies). I’ll find out which movie(s) were seen the most/least, as well as the average rating for each movie. I will also define my approach for null values in this data set (for example, while it might be okay to exclude null values when calculating the average rating for each movie, is this approach acceptable for calculating the average rating for superhero movies in general?).

Data set:

https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv

SQL queries that created the data set are available here:

https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/Project_2_queries.sql

Loading the data set

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sql_url <- "https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/moviedata_CSV.csv"

#read and convert blanks and NULL to NA

movie_data <- read.csv(sql_url, na = c("", "NA", "null", "NULL"))

glimpse(movie_data)

Rows: 6
Columns: 11
$ viewer            <chr> "anne", "cat", "greta", "james", "jeff", "nico"
$ iron_man          <int> 1, NA, NA, 2, NA, 5
$ captain_america   <int> 2, NA, 4, 5, NA, NA
$ avengers          <int> 3, NA, 3, NA, 4, 2
$ super_man         <int> NA, 5, NA, 2, 1, 2
$ birds_of_prey     <int> 5, 4, NA, NA, NA, 5
$ the_suicide_squad <int> 4, NA, NA, 4, NA, 4
$ the_grinch        <int> 1, 2, NA, 3, NA, NA
$ the_holiday       <int> 1, 4, NA, 2, NA, NA
$ love_actually     <int> 2, NA, NA, 5, 5, NA
$ rudolph           <int> 1, 4, NA, NA, 5, NA

Converting to a tibble

#converting to tibble

movie_data <- as_tibble(movie_data) 

view(movie_data)

Pivot longer

#this may be required to analyze the data (making the data into a 2-column format)