** Please click all the tabs (in sequence) to get the entire set of information in these pages. **
** To download code, see the instructions in Session 2: https://rpubs.com/hkb/DAX-Session2 **
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
# install.packages("knitr")
library(knitr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice) # because the package is already installed
Dig into the code, learn how to write a function, and learn some new useful functions
Understand the meaning of the visualizations we did, and how to reproduce them
Identify interesting questions to ask about the topic you’re studying with data (here, movies) and how to answer those questions through data analysis
If you want to learn the fundamentals of R slowly and systematically, here is a very useful tutorial: https://www.guru99.com/r-data-types-operator.html. We’ll cherry pick bits and pieces and cover them below, but please review this on your own.
Here is another tutorial (https://www.statmethods.net/r-tutorial/index.html) that takes you quickly from the starting point - find your way around the R environment (or the RStudio interface to R) - to more complex things like defining and running functions.
Let’s go over a few basic things below.
x <- 5 # what does this do?
y <- x^2 # what would you expect as output
y # and now?
[1] 25
c(1,2,3,4,5) # creates a vector with these numbers
[1] 1 2 3 4 5
1:5 # compact way of doing the same vector
[1] 1 2 3 4 5
seq(1,5) ## seq is a function. it takes 2 or 3 arguments (first 2: from, to)
[1] 1 2 3 4 5
seq(1,5, by=2) # the third argument (if it says "by") is to add by that amount
[1] 1 3 5
seq(1,2,length=3)
[1] 1.0 1.5 2.0
z <- seq(1,80000,400)
mean(z)
[1] 39801
sum(z)
[1] 7960200
# x + "y" # now what?
Now you saw a few functions to create vectors and do some mathematical operations on them.
There are many built-in functions in R (e.g., log, min, max, mean, sum) and many many more functions that are introduced through specialized packages (libraries). It is important to know the syntax for executing a function.
Let’s write a function, and for now this will be a “simple” mathematical function.
When you write a function, you need a name; its arguments (or variables); and the formula (2*n)
doubling <- function(n) {2*n} # name <- function(arg) {formula}
doubling(11)
[1] 22
Let’s write Einstein’s famous formula, E = mc^2
energy <- function(m, c) {m*c^2}
energy(10,3)
[1] 90
Lets load the data by making a call to boxofficemojo.com through the boxoffice() library. If, for some reason, you have not yet installed the package look through Session 2 notes and do it.
date.seq <- paste(2000:2019,"-12-31",sep="")
# Fetch the data
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
We’ll extend the data frame by adding - for each movie in the database - Year, and Rank within Year based on gross revenues.
movies <- movies %>% na.omit() %>% mutate(Year = as.numeric(format(as.Date(date), "%Y"))) # na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date
# Extract the Year, then Rank by Sales
movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>% mutate(rank=row_number())
One of the things we’d like to check is how many movies are included in the database, and for each year.
table(movies$Year)
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
10 10 11 10 8 12 23 22 28 44 48 44
2012 2013 2014 2015 2016 2017 2018 2019
46 41 38 42 48 40 50 47
We can also count how many movies each distributor company has in our data set.
table(movies$distributor)
1091 Media 20th Century A24 Amazon Studi
1 71 8 1
Amazon Studios Anchor Bay E Annapurna Pi Apparition
5 1 3 1
Argot Pictures Atlas Distri Aviron Pictures Big Pictures
1 2 1 1
Bleecker Street Broad Green CBS Films Dreamworks SKG
3 1 5 1
Elephant Eye Entertainmen EuropaCorp FilmDistrict
1 1 2 4
Focus Features Fox Searchlight Freestyle Re GKIDS
26 34 2 4
Greenwich IFC Films Lionsgate Meadowbrook
4 1 27 1
MGM Miramax Miramax/Dime Neon
4 7 1 5
New Line Open Road Orion Pictures Overture Films
4 8 1 4
Paramount Pi Paramount Va Producers Di Pure Flix En
60 6 1 3
Relativity Reliance Big Reliance Ent Rialto Pictures
4 1 1 1
Roadside Att Smith Global Sony Pictures SPEAKproduct
10 1 59 1
STX Entertai Summit Enter The Orchard Universal
9 7 3 47
UTV Communic Walt Disney Warner Bros. Weinstein Co.
3 38 100 20
So, now we can count the number of movies for which we have data in each Year (table(movies$Year)) and the number of times a distributor is featured in the data set (i.e., they made a top 50 movie).