Data Wrangling II

Log in: http://rstudio.saintannsny.org:8787/

Labs:

Lab 1 (functions): http://rpubs.com/jfcross4/110739

Lab 2: (Data Wrangling: dplyr): http://rpubs.com/jfcross4/110757

Lab 3: (Data Wrangling: tidyr): http://rpubs.com/jfcross4/111819

Cheatsheets:

dplyr and tidyr: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

ggplot2: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Loading Packages (one again)

.libPaths("/home/rstudioshared")
library(tidyr)
library(dplyr)
library(ggplot2)

library(drm) # this has today's data which also relates to movies

First, let’s take a look at the data:

?movie #brings up a description of the movie review data fram
View(movie)

Challenges using dplyr

Using your the dplyr ideas you saw in the last lab (you might want to look over that code), try to:

Find the average ratings by reviewer.
Find the average ratings by movie and order movies from highest to lowest average rating.
Find all the movies that had positive reviews from all four reviewers.
Find all movies that have two positive ratings and two negative ratings (and no mixed ratings).

Just for fun, let’s take a bar graph of each reviewers ratings.

ggplot(movie, aes(x=as.factor(y),fill=critic))+geom_histogram(binwidth=1)+facet_grid(~critic)

How would you compare these four reviewers?

Introduciton to tidyr

tidyr can be used to take your data and squeeze it or stretch it into a different configuration. The movie data currently has one row for every movie rating which could be seen as four rows for every movie (one from each reviewer) or 93 rows for each reviewer (one for each movie). The the following lines of code to rearrange your data so that you have, first, one line for each reviewer and 93 ratings in each row and, second, one line for each movie and 4 ratings in each row:

movie %>% spread(movie, y)
movie %>% spread(critic, y)

Depending on the question you are trying to answer, one of these formats may turn out to be more useful than your original data format.

The following lines of code take the second format (one row for each movie) and use it to answer two question for each movie: Does Siskel have a higher rating than Ebert? Do all four movie reviewers agree on the rating? In the latter case, the code returns only those movies where all reviewers had the same rating.

movie %>% spread(critic, y) %>% mutate(siskel.higher=(siskel>ebert))

movie %>% spread(critic, y) %>% mutate(all.agree=(siskel==ebert & ebert==medved & medved==lyons)) %>% 
  filter(all.agree==TRUE)

You can write a function that determines the amount of disagreement between two ratings, x and y, create your function and then run the code below to calculate the amount of disagreement between siskel and ebert on every movie and return the movies with the most disagreement.

pair.disagreement <- function(x, y){ 
  # disagreement function here
  }

movie %>% spread(critic, y) %>% group_by(movie) %>% 
  summarise(disagreement=pair.disagreement(siskel, ebert)) %>% 
  arrange(desc(disagreement))

You might also be interested in the most total disagreement. You can write a function that returns the amount of disagreement between four ratings (a, b, c and d) given the ratings and the average of the ratings and then use the code below to find the movies with the most total disagreement.

four.way.disagreement <- function(a,b,c,d, avg) {
  # code here
}

# the most disagreed upon movies are...

movie %>% spread(critic, y) %>% mutate(avg.review = (siskel+ebert+medved+lyons)/4) %>% 
  group_by(movie) %>% 
  summarise(disagreement=four.way.disagreement(siskel, ebert, medved, lyons, avg.review)) %>% 
  arrange(desc(disagreement))

Finally, you might want to visualize how often Siskel and Ebert agree and disagree. Try running the following (admittedly overly complicated code):

movie.spread <- movie %>% spread(critic, y)
p <- ggplot(movie.spread, aes(x=as.factor(siskel), y=as.factor(ebert)))+
  geom_bin2d(aes(fill=..count..),bins=3) + scale_fill_gradient(
    low = "white",
    high = "red")
se.table <- movie %>% spread(critic, y) %>% group_by(siskel, ebert) %>% summarize(n=length(movie))
p+ geom_text(data=se.table, aes(x=siskel, y=ebert, label=n), col="black")

Data Wrangling II

Probability and Statistics

September 24, 2015