The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
I am using some informal data collected from friends on 10 Sci-Fi Movies released in 2017 and how they would rate them on a 1-5 scale, 1 meaning they hated it and 5 meaning they loved it. A zero indicates that they did not see the movie.
The movie data was collected using a Google Form and the responses are stored in a Google Sheet which I then downloaded and saved as a csv file for import into R for simplicity. In the future I would like to try linking directly to Google Sheets using a googlesheets package that I recently heard about.
library(RCurl)
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Assignment2/Movie%20Ratings.csv")
movies <-data.frame(read.csv(text=x, header=TRUE))
dim(movies)
## [1] 26 13
head(movies, 3)
## Timestamp Name Guardians.of.the.Galaxy.2 Wonder.Woman
## 1 2/5/18 16:13 Eva 0 3
## 2 2/5/18 16:13 Mike Gilbert 5 5
## 3 2/5/18 16:18 Gino 5 5
## Star.Wars..The.Last.Jedi Thor..Ragnarok Blade.Runner.2049
## 1 4 0 0
## 2 5 5 0
## 3 4 5 4
## Spider.Man..Homecoming Alien..Covenant Ghost.in.the.Shell
## 1 0 0 0
## 2 5 0 0
## 3 0 3 0
## War.for.the.Planet.of.the.Apes
## 1 0
## 2 0
## 3 3
## Valerian.and.the.City.of.a.Thousand.Planets Number.of.Movies.Seen
## 1 0 2
## 2 0 5
## 3 0 7
The raw data is a ‘wide’ datatset with movie titles across the top and reviewer names running down the left side. To ‘tidy’ the data the movie titles should be moved to a column so that each combination of a reviewer and movie title is one observation. The tidyr gather function accomplishes this task.
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
tidymovies <- gather(movies, key = 'Movie', value = 'Rating', 3:12)
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
## Timestamp Name Number.of.Movies.Seen
## 207 2/7/18 17:24 Caroline 0
## 250 2/5/18 22:15 Lorie Honor 3
## 103 2/7/18 17:24 Caroline 0
## 236 2/5/18 16:13 Mike Gilbert 5
## 260 2/7/18 18:00 Brian 5
## 114 2/5/18 18:07 Samantha Esposito 4
## 129 2/7/18 17:24 Caroline 0
## 130 2/7/18 18:00 Brian 5
## Movie Rating
## 207 Ghost.in.the.Shell 0
## 250 Valerian.and.the.City.of.a.Thousand.Planets 0
## 103 Thor..Ragnarok 0
## 236 Valerian.and.the.City.of.a.Thousand.Planets 0
## 260 Valerian.and.the.City.of.a.Thousand.Planets 0
## 114 Blade.Runner.2049 0
## 129 Blade.Runner.2049 0
## 130 Blade.Runner.2049 5
The timestamp is just a default of Google Forms that records when the form was submitted, but this info really is not relevant to our movie ratings at all, so I am removing it. Also the number of movies seen should be a function of the other data rather than listed repeatedly on each row, so I am removing that also.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# get rid of timestamp and Number.of.Movies.Seen columns...
tidymovies <- tidymovies[,-c(1,3)]
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
## Name Movie Rating
## 192 Samantha Esposito Ghost.in.the.Shell 0
## 154 Jody Stoll Spider.Man..Homecoming 0
## 90 Tushar Thor..Ragnarok 5
## 65 Hugo Walker Star.Wars..The.Last.Jedi 0
## 194 Tushar Ghost.in.the.Shell 2
## 232 Jody Stoll War.for.the.Planet.of.the.Apes 0
## 218 Samantha Esposito War.for.the.Planet.of.the.Apes 0
## 235 Eva Valerian.and.the.City.of.a.Thousand.Planets 0
My original Movie Ratings form specified that zero meant the reviewer had not seen the movie, so I want to change all zero’s to NA’s so that they can be easily excluded from calculations.
tidymovies$Rating[tidymovies$Rating == 0] <- NA
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
## Name Movie Rating
## 67 Joseph Arminante Star.Wars..The.Last.Jedi NA
## 115 Anon2 Blade.Runner.2049 NA
## 255 Anon4 Valerian.and.the.City.of.a.Thousand.Planets NA
## 95 Katharine rosalen Thor..Ragnarok NA
## 253 Chris VA Valerian.and.the.City.of.a.Thousand.Planets 2
## 69 Katharine rosalen Star.Wars..The.Last.Jedi NA
## 106 Mike Gilbert Blade.Runner.2049 NA
## 36 Samantha Esposito Wonder.Woman NA
tidymovies <- tidymovies %>%
separate(Name, c("Reviewer", "Last"), sep=" ") %>%
select(-Last)
## Warning: Expected 2 pieces. Additional pieces discarded in 20 rows [7, 17,
## 33, 43, 59, 69, 85, 95, 111, 121, 137, 147, 163, 173, 189, 199, 215, 225,
## 241, 251].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 150 rows
## [1, 3, 5, 6, 8, 9, 11, 14, 18, 20, 21, 22, 23, 25, 26, 27, 29, 31, 32,
## 34, ...].
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
## Reviewer Movie Rating
## 192 Samantha Ghost.in.the.Shell NA
## 9 Kerry Guardians.of.the.Galaxy.2 4
## 241 Ryan Valerian.and.the.City.of.a.Thousand.Planets NA
## 210 Mike War.for.the.Planet.of.the.Apes NA
## 7 Ryan Guardians.of.the.Galaxy.2 3
## 48 Michelle Wonder.Woman 5
## 228 Anon3 War.for.the.Planet.of.the.Apes NA
## 237 Gino Valerian.and.the.City.of.a.Thousand.Planets NA
For some reason googlesheets inserted periods in between every word in the Movie titles, so I want to replace all the periods in the movie names with a space to make them look cleaner.
library(stringr)
tidymovies$Movie <- str_replace_all(tidymovies$Movie, "\\.+", " ")
library(DT)
datatable(tidymovies)
I thought about removing all the NA’s from my dataset, but decided not to do that since I had two respondents who didn’t see any of the movies and removing all lines with no rating would have removed those two reviewers from my dataset completely. I thought it was important to note that two of my friends hadn’t seen any of the 10 movies. Obviously they aren’t Sci-Fi fans!
First I want to see the average rating for each movie, but keeping in mind that some movies were only seen by a few of the respondents so their average rating may be biased by the lack of response.
avg_rating_by_movie <- tidymovies %>%
group_by(Movie) %>%
filter(!is.na(Rating)) %>%
summarise(Avg_Rating = round(mean(Rating, na.rm=TRUE),2),
Num_Reviews = n_distinct(Reviewer)) %>%
arrange(desc(Avg_Rating))
datatable(avg_rating_by_movie)
avg_rating_by_movie %>% arrange(desc(Num_Reviews))
## # A tibble: 10 x 3
## Movie Avg_Rating Num_Reviews
## <chr> <dbl> <int>
## 1 Wonder Woman 4.26 19
## 2 Star Wars The Last Jedi 4.22 18
## 3 Guardians of the Galaxy 2 3.94 18
## 4 Thor Ragnarok 4.50 12
## 5 Spider Man Homecoming 3.73 11
## 6 Blade Runner 2049 4.00 10
## 7 War for the Planet of the Apes 4.17 6
## 8 Ghost in the Shell 3.60 5
## 9 Valerian and the City of a Thousand Planets 2.80 5
## 10 Alien Covenant 2.50 4
The numbers above confirm that the highest rated films are also the films with the highest viewing numbers. All 7 of the highest rated movies are the same 7 movies with the greatest number of reviewers. My guess is it’s likely the movies with low attendance numbers have low attendance because people pre-judge the films based on trailers and other advertisements.
Maybe viewers already know or believe they know with a high level of certainty whether they will like or not like a film before they even go see it and watching the film only confirms what they already thought anyway in most cases. If reviewers were not pre-judging films accurately, you wold expect there to be more low ratings where people thought they would like a film and then were disappointed. I guess in general people know what they like!
library(ggplot2)
ggplot(avg_rating_by_movie) +
geom_point(mapping = aes(x=Avg_Rating, y=Movie, color=Num_Reviews))
There are very few low ratings (under 3) in the dataset at all. You can see below that there was only 1 rating of 1 in the entire dataset and 7 ratings of 2 out of 108 total reviews. I wonder if more reviewers had watched more of the films if there would be more low ratings? Or would they have liked the films if they saw them?
tidymovies %>%
filter(!is.na(Rating)) %>%
summarise(n())
## n()
## 1 108
tidymovies %>%
filter(!is.na(Rating)) %>%
group_by(Rating) %>%
summarise(n())
## # A tibble: 5 x 2
## Rating `n()`
## <int> <int>
## 1 1 1
## 2 2 7
## 3 3 26
## 4 4 33
## 5 5 41
hist(tidymovies$Rating, breaks = 15, xlim=c(1,5), col="blue")
The average rating for all movies in the dataset is 3.98
tidymovies %>% summarise(round(mean(Rating, na.rm=TRUE),2))
## round(mean(Rating, na.rm = TRUE), 2)
## 1 3.98
You can see below that the reviewers who gave the highest average ratings were mostly those who saw very few films.
# Average rating given by each reviewer
avg_rating_by_reviewer <- tidymovies %>%
group_by(Reviewer) %>%
filter(!is.na(Rating)) %>%
summarise(Avg_Rating = round(mean(Rating, na.rm=TRUE),2),
Num_Movies = n_distinct(Movie)) %>%
arrange(desc(Avg_Rating))
avg_rating_by_reviewer
## # A tibble: 24 x 3
## Reviewer Avg_Rating Num_Movies
## <chr> <dbl> <int>
## 1 Diane 5.00 1
## 2 Michelle 5.00 4
## 3 Mike 5.00 5
## 4 Mercedes 4.57 7
## 5 Anon2 4.50 2
## 6 Anon4 4.50 2
## 7 Tushar 4.44 9
## 8 Cheri 4.25 4
## 9 Brian 4.20 5
## 10 Kerry 4.20 5
## # ... with 14 more rows
avg_rating_by_reviewer %>% arrange(desc(Num_Movies))
## # A tibble: 24 x 3
## Reviewer Avg_Rating Num_Movies
## <chr> <dbl> <int>
## 1 Betsy 3.70 10
## 2 Tushar 4.44 9
## 3 Chris 3.44 9
## 4 Mercedes 4.57 7
## 5 Gino 4.14 7
## 6 Adam 4.00 7
## 7 Jody 3.17 6
## 8 Mike 5.00 5
## 9 Brian 4.20 5
## 10 Kerry 4.20 5
## # ... with 14 more rows
library(ggplot2)
ggplot(avg_rating_by_reviewer) +
geom_point(mapping = aes(x=Avg_Rating, y=Reviewer, color=Num_Movies))
I don’t think there’s really enough data here to say very much about these films. Especially because I think it is important to have more than a few ratings for each film. My sample size was way too small. Maybe it isn’t important for all reviewers to have seen all films, but I’d be curious if there would be more low ratings if they did.