Data Wrangling

Yesterday, we learned some of the basics of calculations, vectors and functions. Yesterday’s lab can be found online at http://rpubs.com/jfcross4/110739 and, if you would prefer to view today’s lab online, you can find it here: http://rpubs.com/jfcross4/110757.

Today, we’ll work on data wrangling – the art of manipulating data – and start graphing our data. Our data set comes from the International Movie Database (imdb). R Studio’s data wrangling cheat sheet can be found here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf and its ggplot2 cheatsheet (for graphing data) is here: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Loading Packages

Packages contain groups of functions, typically with a common theme, that have written and openly shared. To save space, I have installed a number of packages in a shared folder on the Saint Ann’s server. To access them, you’ll need to point R to the directory where these packages have been installed using the line below:

.libPaths("/home/rstudioshared")

Next, you need to tell R which packages you are going to use. Three packages we’ll likely most often throughout the year are the tidy, dplyr and ggplot2 packages. We’ll use dply and ggplot2 today. You can load these packages using the code below.

library(tidyr)
library(dplyr)
library(ggplot2)

The movie data is contained within the ggplot2 package. Let’s take a look at the data. Try each of the following commands:

?movies #brings up a description of the movies data fram
nrow(movies) #the number of rows (in this case the number of movies)
ncol(movies) #the number of columns or variables
summary(movies) #a summary of each column
View(movies) #brings up a sortable and filterable table

Lots can be done within View where you can sort by or filter by any variable. Here are a few questions you can answer:

How many PG rated movies are in this data set?
What is the longest PG movie? The shortest PG movie?
Which PG movie has the highest rating?

Filtering

It can also be quite using to filter within your code and to be able to assign a subset of your data to a new variable. The first two lines in the code below show two different ways of selecting R-rated. The third line assigns this subset to a variable entitle Rmovies and the final line takes a look at this variable.

filter(movies, mpaa=="R")
movies %>% filter(mpaa == "R")
Rmovies <- movies %>% filter(mpaa == "R")
View(Rmovies)

Top N Lists

The first line, select all R movies, find the top 5 by rating and returns them, arranged by highest rating to lowest rating. This list includes some movies that have very few votes and perhaps shouldn’t be in consideration for the best R-rated movie ever. The second line, limits our search to movies with at least 5000 votes.

movies %>% filter(mpaa == "R") %>% top_n(5, rating) %>% arrange(desc(rating)) 
movies %>% filter(mpaa == "R" & votes>=5000) %>% top_n(5, rating) %>% arrange(desc(rating))

The following code is one long instruction even though it runs onto four lines. This tells R to group movies by their mpaa rating, find the average for each group and sort these group ratings from highest to lowest. Note that some movies do not having a mpaa rating.

movies %>%
 group_by(mpaa) %>%
 summarise(avg.rating = mean(rating)) %>%
 arrange(desc(avg.rating))

In fact, It turns out the most movies don’t have an mpaa. The first command below returns the same table you found above along with the number of movies in each category. The second command finds the average rating, highest rating and number of movies by year.

movies %>%
 group_by(mpaa) %>%
 summarise(avg.rating = mean(rating), num.movies=length(rating)) %>%
 arrange(desc(avg.rating))

movies %>%
 group_by(year) %>%
 summarise(avg.rating = mean(rating), max.rating=max(rating), num.movies=length(rating)) %>%
 arrange(desc(year))

Let’s assign these summaries to variables and plot them using the ggplot2 package. The first piece of code below groups movies by year and assigns the average rating by year to a variable. The second piece graphs this graphs this grouped data, assigning year to the x-axis and and rating to the y-axis. The third graph fits a smooth line to the data.

rating.by.year <- movies %>%
 group_by(year) %>%
 summarise(avg.rating = mean(rating)) %>%
 arrange(desc(year))

ggplot(rating.by.year, aes(x=year, y=avg.rating))+geom_point()

ggplot(rating.by.year, aes(x=year, y=avg.rating))+geom_point()+geom_smooth()

We can group data by more than one variable. The code below groups data by year and mpaa category and the second batch of code graphs this data with points colored by mpaa rating.

rating.by.year.mpaa <- movies %>%
 group_by(year, mpaa) %>%
 summarise(avg.rating = mean(rating)) %>%
 arrange(desc(year))

ggplot(rating.by.year.mpaa, aes(x=year, y=avg.rating, color=mpaa))+geom_point()

Going back to the original, dataset, here’s a boxplot of ratings by mpaa category. Which mpaa rating has the highest median rating?

ggplot(movies, aes(mpaa, rating))+geom_boxplot()

Homework for Thursday:

Revisit yesterday’s code (http://rpubs.com/jfcross4/110739) and, in particular, look at the section on writing functions. Write your own function (it can be as simple or complex as you like) and use it. You can print out your code and example or simply send it to me in an email. Here’s are two examples of what this might look like:

#Example 1: Counting Combinations
CHOOSE <- function(a,b){ factorial(a)/(factorial(b)*factorial(a-b))}
CHOOSE(4,2)
#This function calculates the number of ways to choose b items from a possible choices.  Ex: There are 6 ways to 2 items from 4 choices.

#Example 2: Is a number divisible by 3?
DIV3 <- function(x){x %% 3 == 0}
DIV3(1975642) #returns FALSE because 1975642 is not divisible by 3

Data Wrangling

Probability and Statistics

9/21/2015