** Please click all the tabs (in sequence) to get the entire set of information in these pages. **

** To download code, see the instructions in Session 2: https://rpubs.com/hkb/DAX-Session2 **

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
# install.packages("knitr")
library(knitr)

library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice) # because the package is already installed

Session 3

Objectives

  • Dig into the code, learn how to write a function, and learn some new useful functions

  • Understand the meaning of the visualizations we did, and how to reproduce them

  • Identify interesting questions to ask about the topic you’re studying with data (here, movies) and how to answer those questions through data analysis

Refresher and Background

If you want to learn the fundamentals of R slowly and systematically, here is a very useful tutorial: https://www.guru99.com/r-data-types-operator.html. We’ll cherry pick bits and pieces and cover them below, but please review this on your own.

Here is another tutorial (https://www.statmethods.net/r-tutorial/index.html) that takes you quickly from the starting point - find your way around the R environment (or the RStudio interface to R) - to more complex things like defining and running functions.

Let’s go over a few basic things below.

x <- 5 # what does this do? 
y <- x^2 # what would you expect as output
y # and now? 
[1] 25
c(1,2,3,4,5) # creates a vector with these numbers
[1] 1 2 3 4 5
1:5 # compact way of doing the same vector
[1] 1 2 3 4 5
seq(1,5) ## seq is a function. it takes 2 or 3 arguments (first 2: from, to)
[1] 1 2 3 4 5
seq(1,5, by=2) # the third argument (if it says "by") is to add by that amount
[1] 1 3 5
seq(1,2,length=3)
[1] 1.0 1.5 2.0
z <- seq(1,80000,400)
mean(z)
[1] 39801
sum(z)
[1] 7960200
# x + "y" # now what? 

Now you saw a few functions to create vectors and do some mathematical operations on them.

Functions

There are many built-in functions in R (e.g., log, min, max, mean, sum) and many many more functions that are introduced through specialized packages (libraries). It is important to know the syntax for executing a function.

Let’s write a function, and for now this will be a “simple” mathematical function.

When you write a function, you need a name; its arguments (or variables); and the formula (2*n)

doubling <- function(n) {2*n} # name <- function(arg) {formula}

doubling(11)
[1] 22

Let’s write Einstein’s famous formula, E = mc^2

energy <- function(m, c) {m*c^2}
energy(10,3)
[1] 90

Movies Data Set

Lets load the data by making a call to boxofficemojo.com through the boxoffice() library. If, for some reason, you have not yet installed the package look through Session 2 notes and do it.

date.seq <- paste(2000:2019,"-12-31",sep="") 
# Fetch the data 
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)

We’ll extend the data frame by adding - for each movie in the database - Year, and Rank within Year based on gross revenues.

movies <- movies %>% na.omit() %>% mutate(Year =  as.numeric(format(as.Date(date), "%Y"))) # na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date

# Extract the Year, then Rank by Sales

movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>%  mutate(rank=row_number())

One of the things we’d like to check is how many movies are included in the database, and for each year.

table(movies$Year)

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
  10   10   11   10    8   12   23   22   28   44   48   44 
2012 2013 2014 2015 2016 2017 2018 2019 
  46   41   38   42   48   40   50   47 

We can also count how many movies each distributor company has in our data set.


table(movies$distributor)

     1091 Media    20th Century             A24    Amazon Studi 
              1              71               8               1 
 Amazon Studios    Anchor Bay E    Annapurna Pi      Apparition 
              5               1               3               1 
 Argot Pictures    Atlas Distri Aviron Pictures    Big Pictures 
              1               2               1               1 
Bleecker Street    Broad Green        CBS Films  Dreamworks SKG 
              3               1               5               1 
   Elephant Eye    Entertainmen      EuropaCorp    FilmDistrict 
              1               1               2               4 
 Focus Features Fox Searchlight    Freestyle Re           GKIDS 
             26              34               2               4 
      Greenwich       IFC Films       Lionsgate     Meadowbrook 
              4               1              27               1 
            MGM         Miramax    Miramax/Dime            Neon 
              4               7               1               5 
       New Line       Open Road  Orion Pictures  Overture Films 
              4               8               1               4 
   Paramount Pi    Paramount Va    Producers Di    Pure Flix En 
             60               6               1               3 
     Relativity    Reliance Big    Reliance Ent Rialto Pictures 
              4               1               1               1 
   Roadside Att    Smith Global   Sony Pictures    SPEAKproduct 
             10               1              59               1 
   STX Entertai    Summit Enter     The Orchard       Universal 
              9               7               3              47 
   UTV Communic     Walt Disney    Warner Bros.   Weinstein Co. 
              3              38             100              20 

So, now we can count the number of movies for which we have data in each Year (table(movies$Year)) and the number of times a distributor is featured in the data set (i.e., they made a top 50 movie).

LS0tCnRpdGxlOiAiU2Vzc2lvbiAzIgphdXRob3I6ICJIZW1hbnQgQmhhcmdhdmEiCmRhdGU6ICI3LzI0LzIwMjAiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCioqIFBsZWFzZSBjbGljayBhbGwgdGhlIHRhYnMgKGluIHNlcXVlbmNlKSB0byBnZXQgdGhlIGVudGlyZSBzZXQgb2YgaW5mb3JtYXRpb24gaW4gdGhlc2UgcGFnZXMuICoqCgoqKiBUbyBkb3dubG9hZCBjb2RlLCBzZWUgdGhlIGluc3RydWN0aW9ucyBpbiBTZXNzaW9uIDI6IGh0dHBzOi8vcnB1YnMuY29tL2hrYi9EQVgtU2Vzc2lvbjIgKioKCgpgYGB7ciBzZXR1cH0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG8gPSBUUlVFLCB3YXJuaW5nPUZBTFNFLCBtZXNzYWdlPUZBTFNFKQpvcHRpb25zKHNjaXBlbj0xMDAwMDAwMCkKb3B0aW9ucyhkaWdpdHM9MykKYGBgCgpgYGB7ciBwYWNrYWdlc30KIyBpbnN0YWxsLnBhY2thZ2VzKCJrbml0ciIpCmxpYnJhcnkoa25pdHIpCgpsaWJyYXJ5KGRwbHlyKQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShnZ3Bsb3QyKQpsaWJyYXJ5KGdyaWRFeHRyYSkKbGlicmFyeShnZ3JlcGVsKQpsaWJyYXJ5KGJveG9mZmljZSkgIyBiZWNhdXNlIHRoZSBwYWNrYWdlIGlzIGFscmVhZHkgaW5zdGFsbGVkCmBgYAoKIyBTZXNzaW9uIDMKCiMjIE9iamVjdGl2ZXMKCiogRGlnIGludG8gdGhlIGNvZGUsIGxlYXJuIGhvdyB0byB3cml0ZSBhIGZ1bmN0aW9uLCBhbmQgbGVhcm4gc29tZSBuZXcgdXNlZnVsIGZ1bmN0aW9ucwoKKiBVbmRlcnN0YW5kIHRoZSBtZWFuaW5nIG9mIHRoZSB2aXN1YWxpemF0aW9ucyB3ZSBkaWQsIGFuZCBob3cgdG8gcmVwcm9kdWNlIHRoZW0KCiogSWRlbnRpZnkgaW50ZXJlc3RpbmcgcXVlc3Rpb25zIHRvIGFzayBhYm91dCB0aGUgdG9waWMgeW91J3JlIHN0dWR5aW5nIHdpdGggZGF0YSAoaGVyZSwgbW92aWVzKSBhbmQgaG93IHRvIGFuc3dlciB0aG9zZSBxdWVzdGlvbnMgdGhyb3VnaCBkYXRhIGFuYWx5c2lzCgojIyBSZWZyZXNoZXIgYW5kIEJhY2tncm91bmQKCklmIHlvdSB3YW50IHRvIGxlYXJuIHRoZSBmdW5kYW1lbnRhbHMgb2YgUiBzbG93bHkgYW5kIHN5c3RlbWF0aWNhbGx5LCBoZXJlIGlzIGEgdmVyeSB1c2VmdWwgdHV0b3JpYWw6IGh0dHBzOi8vd3d3Lmd1cnU5OS5jb20vci1kYXRhLXR5cGVzLW9wZXJhdG9yLmh0bWwuIFdlJ2xsIGNoZXJyeSBwaWNrIGJpdHMgYW5kIHBpZWNlcyBhbmQgY292ZXIgdGhlbSBiZWxvdywgYnV0IHBsZWFzZSByZXZpZXcgdGhpcyBvbiB5b3VyIG93bi4gCgpIZXJlIGlzIGFub3RoZXIgdHV0b3JpYWwgKGh0dHBzOi8vd3d3LnN0YXRtZXRob2RzLm5ldC9yLXR1dG9yaWFsL2luZGV4Lmh0bWwpIHRoYXQgdGFrZXMgeW91IHF1aWNrbHkgZnJvbSB0aGUgc3RhcnRpbmcgcG9pbnQgLSBmaW5kIHlvdXIgd2F5IGFyb3VuZCB0aGUgUiBlbnZpcm9ubWVudCAob3IgdGhlIFJTdHVkaW8gaW50ZXJmYWNlIHRvIFIpIC0gdG8gbW9yZSBjb21wbGV4IHRoaW5ncyBsaWtlIGRlZmluaW5nIGFuZCBydW5uaW5nIGZ1bmN0aW9ucy4KCkxldCdzIGdvIG92ZXIgYSBmZXcgYmFzaWMgdGhpbmdzIGJlbG93LiAgCgpgYGB7ciBlbGVtZW50cy5mcmFtZXMuYXNzaWdubWVudH0KeCA8LSA1ICMgd2hhdCBkb2VzIHRoaXMgZG8/IAp5IDwtIHheMiAjIHdoYXQgd291bGQgeW91IGV4cGVjdCBhcyBvdXRwdXQKeSAjIGFuZCBub3c/IApjKDEsMiwzLDQsNSkgIyBjcmVhdGVzIGEgdmVjdG9yIHdpdGggdGhlc2UgbnVtYmVycwoxOjUgIyBjb21wYWN0IHdheSBvZiBkb2luZyB0aGUgc2FtZSB2ZWN0b3IKc2VxKDEsNSkgIyMgc2VxIGlzIGEgZnVuY3Rpb24uIGl0IHRha2VzIDIgb3IgMyBhcmd1bWVudHMgKGZpcnN0IDI6IGZyb20sIHRvKQpzZXEoMSw1LCBieT0yKSAjIHRoZSB0aGlyZCBhcmd1bWVudCAoaWYgaXQgc2F5cyAiYnkiKSBpcyB0byBhZGQgYnkgdGhhdCBhbW91bnQKc2VxKDEsMixsZW5ndGg9MykKeiA8LSBzZXEoMSw4MDAwMCw0MDApCm1lYW4oeikKc3VtKHopCgojIHggKyAieSIgIyBub3cgd2hhdD8gCgpgYGAKCk5vdyB5b3Ugc2F3IGEgZmV3IGZ1bmN0aW9ucyB0byBjcmVhdGUgdmVjdG9ycyBhbmQgZG8gc29tZSBtYXRoZW1hdGljYWwgb3BlcmF0aW9ucyBvbiB0aGVtLiAKCiMjIEZ1bmN0aW9ucwoKVGhlcmUgYXJlIG1hbnkgYnVpbHQtaW4gZnVuY3Rpb25zIGluIFIgKGUuZy4sIGxvZywgbWluLCBtYXgsIG1lYW4sIHN1bSkgYW5kIG1hbnkgbWFueSBtb3JlIGZ1bmN0aW9ucyB0aGF0IGFyZSBpbnRyb2R1Y2VkIHRocm91Z2ggc3BlY2lhbGl6ZWQgcGFja2FnZXMgKGxpYnJhcmllcykuIEl0IGlzIGltcG9ydGFudCB0byBrbm93IHRoZSBzeW50YXggZm9yIGV4ZWN1dGluZyBhIGZ1bmN0aW9uLiAKCkxldCdzIHdyaXRlIGEgZnVuY3Rpb24sIGFuZCBmb3Igbm93IHRoaXMgd2lsbCBiZSBhICJzaW1wbGUiIG1hdGhlbWF0aWNhbCBmdW5jdGlvbi4gCgpXaGVuIHlvdSB3cml0ZSBhIGZ1bmN0aW9uLCB5b3UgbmVlZCBhIG5hbWU7IGl0cyBhcmd1bWVudHMgKG9yIHZhcmlhYmxlcyk7IGFuZCB0aGUgZm9ybXVsYSAoMipuKSAKCgpgYGB7ciBkb3VibGluZy5mdW5jdGlvbn0KZG91YmxpbmcgPC0gZnVuY3Rpb24obikgezIqbn0gIyBuYW1lIDwtIGZ1bmN0aW9uKGFyZykge2Zvcm11bGF9Cgpkb3VibGluZygxMSkKYGBgCgpMZXQncyB3cml0ZSBFaW5zdGVpbidzIGZhbW91cyBmb3JtdWxhLCBFID0gbWNeMgoKYGBge3J9CmVuZXJneSA8LSBmdW5jdGlvbihtLCBjKSB7bSpjXjJ9CmVuZXJneSgxMCwzKQpgYGAKCiMjIyBNb3ZpZXMgRGF0YSBTZXQKCkxldHMgbG9hZCB0aGUgZGF0YSBieSBtYWtpbmcgYSBjYWxsIHRvIGJveG9mZmljZW1vam8uY29tIHRocm91Z2ggdGhlIGJveG9mZmljZSgpIGxpYnJhcnkuIElmLCBmb3Igc29tZSByZWFzb24sIHlvdSBoYXZlIG5vdCB5ZXQgaW5zdGFsbGVkIHRoZSBwYWNrYWdlIGxvb2sgdGhyb3VnaCBTZXNzaW9uIDIgbm90ZXMgYW5kIGRvIGl0LiAKCmBgYHtyIG1vdmllcy5kYXRhfQpkYXRlLnNlcSA8LSBwYXN0ZSgyMDAwOjIwMTksIi0xMi0zMSIsc2VwPSIiKSAKIyBGZXRjaCB0aGUgZGF0YSAKbW92aWVzIDwtIGJveG9mZmljZShkYXRlID0gYXMuRGF0ZShkYXRlLnNlcSksIHRvcF9uID0gNTApCmBgYAoKV2UnbGwgZXh0ZW5kIHRoZSBkYXRhIGZyYW1lIGJ5IGFkZGluZyAtIGZvciBlYWNoIG1vdmllIGluIHRoZSBkYXRhYmFzZSAtIFllYXIsIGFuZCBSYW5rIHdpdGhpbiBZZWFyIGJhc2VkIG9uIGdyb3NzIHJldmVudWVzLiAKCmBgYHtyIG1vdmllcy5leHRlbmR9IAptb3ZpZXMgPC0gbW92aWVzICU+JSBuYS5vbWl0KCkgJT4lIG11dGF0ZShZZWFyID0gIGFzLm51bWVyaWMoZm9ybWF0KGFzLkRhdGUoZGF0ZSksICIlWSIpKSkgIyBuYS5vbWl0KCkgb21pdHMgdGhlIHJvd3Mgd2l0aCBOQSB2YWx1ZXM7IGNyZWF0ZSBuZXcgY29sdW1uIFllYXIuIHdoaWNoIGV4dHJhY3RzIHRoZSBZICh5ZWFyKSBmcm9tIHRoZSBkYXRlCgojIEV4dHJhY3QgdGhlIFllYXIsIHRoZW4gUmFuayBieSBTYWxlcwoKbW92aWVzIDwtIG1vdmllcyAlPiUgZ3JvdXBfYnkoWWVhcikgJT4lIGFycmFuZ2UoZGVzYyh0b3RhbF9ncm9zcykpICU+JSAgbXV0YXRlKHJhbms9cm93X251bWJlcigpKQoKYGBgCgpPbmUgb2YgdGhlIHRoaW5ncyB3ZSdkIGxpa2UgdG8gY2hlY2sgaXMgaG93IG1hbnkgbW92aWVzIGFyZSBpbmNsdWRlZCBpbiB0aGUgZGF0YWJhc2UsIGFuZCBmb3IgZWFjaCB5ZWFyLiAKCmBgYHtyIHRhYmxlLlllYXJ9CnRhYmxlKG1vdmllcyRZZWFyKQpgYGAKV2UgY2FuIGFsc28gY291bnQgaG93IG1hbnkgbW92aWVzIGVhY2ggZGlzdHJpYnV0b3IgY29tcGFueSBoYXMgaW4gb3VyIGRhdGEgc2V0LiAKCmBgYHtyfQoKdGFibGUobW92aWVzJGRpc3RyaWJ1dG9yKQpgYGAKClNvLCBub3cgd2UgY2FuIGNvdW50IHRoZSBudW1iZXIgb2YgbW92aWVzIGZvciB3aGljaCB3ZSBoYXZlIGRhdGEgaW4gZWFjaCBZZWFyICh0YWJsZShtb3ZpZXMkWWVhcikpIGFuZCB0aGUgbnVtYmVyIG9mIHRpbWVzIGEgZGlzdHJpYnV0b3IgaXMgZmVhdHVyZWQgaW4gdGhlIGRhdGEgc2V0IChpLmUuLCB0aGV5IG1hZGUgYSB0b3AgNTAgbW92aWUpLg==