Assignment Instructions

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the ‘wide’ datasets identified in the Week 5 Discussion items. For each of the three chosen datasets:
    • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a ‘wide’ structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
    • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
    • Perform the analysis requested in the discussion item.
    • Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Dataset One - Movie Data

Load the Raw Data into R from CSV File

I am using some informal data collected from friends on 10 Sci-Fi Movies released in 2017 and how they would rate them on a 1-5 scale, 1 meaning they hated it and 5 meaning they loved it. A zero indicates that they did not see the movie.

The movie data was collected using a Google Form and the responses are stored in a Google Sheet which I then downloaded and saved as a csv file for import into R for simplicity. In the future I would like to try linking directly to Google Sheets using a googlesheets package that I recently heard about.

library(RCurl) 
## Loading required package: bitops
x <- getURL("https://raw.githubusercontent.com/betsyrosalen/DATA_607_Data_Acquisition_and_Management/master/Assignment2/Movie%20Ratings.csv") 
movies <-data.frame(read.csv(text=x, header=TRUE))
dim(movies)
## [1] 26 13
head(movies, 3)
##      Timestamp         Name Guardians.of.the.Galaxy.2 Wonder.Woman
## 1 2/5/18 16:13          Eva                         0            3
## 2 2/5/18 16:13 Mike Gilbert                         5            5
## 3 2/5/18 16:18         Gino                         5            5
##   Star.Wars..The.Last.Jedi Thor..Ragnarok Blade.Runner.2049
## 1                        4              0                 0
## 2                        5              5                 0
## 3                        4              5                 4
##   Spider.Man..Homecoming Alien..Covenant Ghost.in.the.Shell
## 1                      0               0                  0
## 2                      5               0                  0
## 3                      0               3                  0
##   War.for.the.Planet.of.the.Apes
## 1                              0
## 2                              0
## 3                              3
##   Valerian.and.the.City.of.a.Thousand.Planets Number.of.Movies.Seen
## 1                                           0                     2
## 2                                           0                     5
## 3                                           0                     7

Use tidyr to move header row to Movie column

The raw data is a ‘wide’ datatset with movie titles across the top and reviewer names running down the left side. To ‘tidy’ the data the movie titles should be moved to a column so that each combination of a reviewer and movie title is one observation. The tidyr gather function accomplishes this task.

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
tidymovies <- gather(movies, key = 'Movie', value = 'Rating', 3:12)
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
##        Timestamp              Name Number.of.Movies.Seen
## 207 2/7/18 17:24          Caroline                     0
## 250 2/5/18 22:15       Lorie Honor                     3
## 103 2/7/18 17:24          Caroline                     0
## 236 2/5/18 16:13      Mike Gilbert                     5
## 260 2/7/18 18:00             Brian                     5
## 114 2/5/18 18:07 Samantha Esposito                     4
## 129 2/7/18 17:24          Caroline                     0
## 130 2/7/18 18:00             Brian                     5
##                                           Movie Rating
## 207                          Ghost.in.the.Shell      0
## 250 Valerian.and.the.City.of.a.Thousand.Planets      0
## 103                              Thor..Ragnarok      0
## 236 Valerian.and.the.City.of.a.Thousand.Planets      0
## 260 Valerian.and.the.City.of.a.Thousand.Planets      0
## 114                           Blade.Runner.2049      0
## 129                           Blade.Runner.2049      0
## 130                           Blade.Runner.2049      5

Remove Timestamp and Number of Movies Seen columns

The timestamp is just a default of Google Forms that records when the form was submitted, but this info really is not relevant to our movie ratings at all, so I am removing it. Also the number of movies seen should be a function of the other data rather than listed repeatedly on each row, so I am removing that also.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# get rid of timestamp and Number.of.Movies.Seen columns...
tidymovies <- tidymovies[,-c(1,3)]
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
##                  Name                                       Movie Rating
## 192 Samantha Esposito                          Ghost.in.the.Shell      0
## 154        Jody Stoll                      Spider.Man..Homecoming      0
## 90            Tushar                               Thor..Ragnarok      5
## 65        Hugo Walker                    Star.Wars..The.Last.Jedi      0
## 194           Tushar                           Ghost.in.the.Shell      2
## 232        Jody Stoll              War.for.the.Planet.of.the.Apes      0
## 218 Samantha Esposito              War.for.the.Planet.of.the.Apes      0
## 235               Eva Valerian.and.the.City.of.a.Thousand.Planets      0

Change zeros to NA’s

My original Movie Ratings form specified that zero meant the reviewer had not seen the movie, so I want to change all zero’s to NA’s so that they can be easily excluded from calculations.

tidymovies$Rating[tidymovies$Rating == 0] <- NA
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
##                   Name                                       Movie Rating
## 67    Joseph Arminante                    Star.Wars..The.Last.Jedi     NA
## 115              Anon2                           Blade.Runner.2049     NA
## 255              Anon4 Valerian.and.the.City.of.a.Thousand.Planets     NA
## 95  Katharine rosalen                               Thor..Ragnarok     NA
## 253           Chris VA Valerian.and.the.City.of.a.Thousand.Planets      2
## 69  Katharine rosalen                     Star.Wars..The.Last.Jedi     NA
## 106       Mike Gilbert                           Blade.Runner.2049     NA
## 36   Samantha Esposito                                Wonder.Woman     NA

Use dplyr to split first and last names and then remove last names

tidymovies <- tidymovies %>%
  separate(Name, c("Reviewer", "Last"), sep=" ") %>%
  select(-Last)
## Warning: Expected 2 pieces. Additional pieces discarded in 20 rows [7, 17,
## 33, 43, 59, 69, 85, 95, 111, 121, 137, 147, 163, 173, 189, 199, 215, 225,
## 241, 251].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 150 rows
## [1, 3, 5, 6, 8, 9, 11, 14, 18, 20, 21, 22, 23, 25, 26, 27, 29, 31, 32,
## 34, ...].
dplyr::sample_frac(tidymovies, 0.03, replace=FALSE)
##     Reviewer                                       Movie Rating
## 192 Samantha                          Ghost.in.the.Shell     NA
## 9      Kerry                   Guardians.of.the.Galaxy.2      4
## 241     Ryan Valerian.and.the.City.of.a.Thousand.Planets     NA
## 210     Mike              War.for.the.Planet.of.the.Apes     NA
## 7       Ryan                   Guardians.of.the.Galaxy.2      3
## 48  Michelle                                Wonder.Woman      5
## 228    Anon3              War.for.the.Planet.of.the.Apes     NA
## 237     Gino Valerian.and.the.City.of.a.Thousand.Planets     NA

Clean up movie titles

For some reason googlesheets inserted periods in between every word in the Movie titles, so I want to replace all the periods in the movie names with a space to make them look cleaner.

library(stringr)
tidymovies$Movie <- str_replace_all(tidymovies$Movie, "\\.+", " ")
library(DT)
datatable(tidymovies)

A note about NA values

I thought about removing all the NA’s from my dataset, but decided not to do that since I had two respondents who didn’t see any of the movies and removing all lines with no rating would have removed those two reviewers from my dataset completely. I thought it was important to note that two of my friends hadn’t seen any of the 10 movies. Obviously they aren’t Sci-Fi fans!

Summary analysis with dplyr and ggplot2

First I want to see the average rating for each movie, but keeping in mind that some movies were only seen by a few of the respondents so their average rating may be biased by the lack of response.

avg_rating_by_movie <- tidymovies %>% 
  group_by(Movie) %>%
  filter(!is.na(Rating)) %>%
  summarise(Avg_Rating = round(mean(Rating, na.rm=TRUE),2), 
            Num_Reviews = n_distinct(Reviewer)) %>%
  arrange(desc(Avg_Rating))
datatable(avg_rating_by_movie)
avg_rating_by_movie %>% arrange(desc(Num_Reviews))
## # A tibble: 10 x 3
##    Movie                                       Avg_Rating Num_Reviews
##    <chr>                                            <dbl>       <int>
##  1 Wonder Woman                                      4.26          19
##  2 Star Wars The Last Jedi                           4.22          18
##  3 Guardians of the Galaxy 2                         3.94          18
##  4 Thor Ragnarok                                     4.50          12
##  5 Spider Man Homecoming                             3.73          11
##  6 Blade Runner 2049                                 4.00          10
##  7 War for the Planet of the Apes                    4.17           6
##  8 Ghost in the Shell                                3.60           5
##  9 Valerian and the City of a Thousand Planets       2.80           5
## 10 Alien Covenant                                    2.50           4

The numbers above confirm that the highest rated films are also the films with the highest viewing numbers. All 7 of the highest rated movies are the same 7 movies with the greatest number of reviewers. My guess is it’s likely the movies with low attendance numbers have low attendance because people pre-judge the films based on trailers and other advertisements.

Maybe viewers already know or believe they know with a high level of certainty whether they will like or not like a film before they even go see it and watching the film only confirms what they already thought anyway in most cases. If reviewers were not pre-judging films accurately, you wold expect there to be more low ratings where people thought they would like a film and then were disappointed. I guess in general people know what they like!

library(ggplot2)
ggplot(avg_rating_by_movie) +
  geom_point(mapping = aes(x=Avg_Rating, y=Movie, color=Num_Reviews))

There are very few low ratings (under 3) in the dataset at all. You can see below that there was only 1 rating of 1 in the entire dataset and 7 ratings of 2 out of 108 total reviews. I wonder if more reviewers had watched more of the films if there would be more low ratings? Or would they have liked the films if they saw them?

tidymovies %>% 
  filter(!is.na(Rating)) %>%
  summarise(n())
##   n()
## 1 108
tidymovies %>% 
  filter(!is.na(Rating)) %>%
  group_by(Rating) %>%
  summarise(n())
## # A tibble: 5 x 2
##   Rating `n()`
##    <int> <int>
## 1      1     1
## 2      2     7
## 3      3    26
## 4      4    33
## 5      5    41
hist(tidymovies$Rating, breaks = 15, xlim=c(1,5), col="blue")

The average rating for all movies in the dataset is 3.98

tidymovies %>% summarise(round(mean(Rating, na.rm=TRUE),2))
##   round(mean(Rating, na.rm = TRUE), 2)
## 1                                 3.98

You can see below that the reviewers who gave the highest average ratings were mostly those who saw very few films.

# Average rating given by each reviewer
avg_rating_by_reviewer <- tidymovies %>% 
  group_by(Reviewer) %>%
  filter(!is.na(Rating)) %>%
  summarise(Avg_Rating = round(mean(Rating, na.rm=TRUE),2),
            Num_Movies = n_distinct(Movie)) %>%
  arrange(desc(Avg_Rating))
avg_rating_by_reviewer
## # A tibble: 24 x 3
##    Reviewer Avg_Rating Num_Movies
##    <chr>         <dbl>      <int>
##  1 Diane          5.00          1
##  2 Michelle       5.00          4
##  3 Mike           5.00          5
##  4 Mercedes       4.57          7
##  5 Anon2          4.50          2
##  6 Anon4          4.50          2
##  7 Tushar         4.44          9
##  8 Cheri          4.25          4
##  9 Brian          4.20          5
## 10 Kerry          4.20          5
## # ... with 14 more rows
avg_rating_by_reviewer %>% arrange(desc(Num_Movies))
## # A tibble: 24 x 3
##    Reviewer Avg_Rating Num_Movies
##    <chr>         <dbl>      <int>
##  1 Betsy          3.70         10
##  2 Tushar         4.44          9
##  3 Chris          3.44          9
##  4 Mercedes       4.57          7
##  5 Gino           4.14          7
##  6 Adam           4.00          7
##  7 Jody           3.17          6
##  8 Mike           5.00          5
##  9 Brian          4.20          5
## 10 Kerry          4.20          5
## # ... with 14 more rows
library(ggplot2)
ggplot(avg_rating_by_reviewer) +
  geom_point(mapping = aes(x=Avg_Rating, y=Reviewer, color=Num_Movies))

Conclusion

I don’t think there’s really enough data here to say very much about these films. Especially because I think it is important to have more than a few ratings for each film. My sample size was way too small. Maybe it isn’t important for all reviewers to have seen all films, but I’d be curious if there would be more low ratings if they did.