Business Analytics Lab Worksheet 03

About

Quantitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.

Qualitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.

In this lab, we will explore the marketing data set and understand it better through simple statistics.

Setup

Make sure to download the folder titled ‘bsad_lab03’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.

Task 1

First, we begin by reading in the data from the ‘rottentomatoes.csv’ file, and viewing it to make sure we see it being read in correctly.

mydata = read.csv(file="data/rottentomatoes.csv")
head(mydata)

##                                                     ï..title
## 1                                                 AvatarÃ<U+0082>Â 
## 2               Pirates of the Caribbean: At World's EndÃ<U+0082>Â 
## 3                                                SpectreÃ<U+0082>Â 
## 4                                  The Dark Knight RisesÃ<U+0082>Â 
## 5 Star Wars: Episode VII - The Force AwakensÃ<U+0082>Â             
## 6                                            John CarterÃ<U+0082>Â 
##                            genres          director          actor1
## 1 Action|Adventure|Fantasy|Sci-Fi     James Cameron     CCH Pounder
## 2        Action|Adventure|Fantasy    Gore Verbinski     Johnny Depp
## 3       Action|Adventure|Thriller        Sam Mendes Christoph Waltz
## 4                 Action|Thriller Christopher Nolan       Tom Hardy
## 5                     Documentary       Doug Walker     Doug Walker
## 6         Action|Adventure|Sci-Fi    Andrew Stanton    Daryl Sabara
##             actor2               actor3 length    budget director_fb_likes
## 1 Joel David Moore            Wes Studi    178 237000000                 0
## 2    Orlando Bloom       Jack Davenport    169 300000000               563
## 3     Rory Kinnear     Stephanie Sigman    148 245000000                 0
## 4   Christian Bale Joseph Gordon-Levitt    164 250000000             22000
## 5       Rob Walker                          NA        NA               131
## 6  Samantha Morton         Polly Walker    132 263700000               475
##   actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1            1000             936             855             4834
## 2           40000            5000            1000            48350
## 3           11000             393             161            11700
## 4           27000           23000           23000           106759
## 5             131              12              NA              143
## 6             640             632             530             1873
##   fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1    33000            723          3054      886204   7.9         1.78
## 2        0            302          1238      471220   7.1         2.35
## 3    85000            602           994      275868   6.8         2.35
## 4   164000            813          2701     1144337   8.5         2.35
## 5        0             NA            NA           8   7.1           NA
## 6    24000            462           738      212204   6.6         2.35
##       gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5        NA   NA
## 6  73058679 2012

Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘gross’. Follow the example and do the same for user votes, total cast likes, director facebook likes, and critic reviews.

Gross

gross = mydata$gross
#Max Gross
max_gross = max(gross,na.rm = TRUE)
max_gross

## [1] 760505847

which.max(gross)

## [1] 1

#Min Gross
min_gross = min(gross,na.rm = TRUE)
min_gross

## [1] 162

which.min(gross)

## [1] 3331

#Range
max_gross - min_gross

## [1] 760505685

#Mean
mean_gross = mean(gross,na.rm = TRUE)
mean_gross

## [1] 48468408

#Standard Deviation
sd_gross = sd(gross,na.rm = TRUE)
sd_gross

## [1] 68452990

#Variance
var_gross = var(gross,na.rm = TRUE)
var_gross

## [1] 4.685812e+15

Here, the maximum gross in all of the movies is 760.51 billion dollars, located in the first row, which is Avatar. The movie that earned the least amount of sales is Skin Trade, grossing at 162 thousand dollars.

User Votes

users_votes = mydata$users_votes
#Max User Votes
max_users_votes = max(users_votes,na.rm = TRUE)
max_users_votes

## [1] 1689764

which.max(users_votes)

## [1] 1938

#Min User Votes
min_users_votes = min(users_votes,na.rm = TRUE)
min_users_votes

## [1] 5

which.min(users_votes)

## [1] 4703

#Range
max_users_votes - min_users_votes

## [1] 1689759

#Mean
mean_users_votes = mean(users_votes,na.rm = TRUE)
mean_users_votes

## [1] 83668.16

#Standard Deviation
sd_users_votes = sd(users_votes,na.rm = TRUE)
sd_users_votes

## [1] 138485.3

#Variance
var_users_votes = var(users_votes,na.rm = TRUE)
var_users_votes

## [1] 19178166353

The movie with the most user votes, The Shawshank Redemption, located in the 1,938th row, clocked in at 1.69 million votes, whereas the movie with the least user votes, located in the 4,703rd row, The Hadza: The Last of the First, recorded 5 user reviews. The range for user likes is 1,689,759, the mean is 83,668.16, the standard deviation is 138,485.3, and the variation is 19,178,166,353.

Total Cast Likes

total_cast_likes = mydata$total_cast_likes
#Max Total Cast Likes
max_total_cast_likes = max(total_cast_likes,na.rm = TRUE)
max_total_cast_likes

## [1] 656730

which.max(total_cast_likes)

## [1] 1903

#Min Total Cast Likes
min_total_cast_likes = min(total_cast_likes,na.rm = TRUE)
min_total_cast_likes

## [1] 0

which.min(total_cast_likes)

## [1] 2242

#Range
max_total_cast_likes - min_total_cast_likes

## [1] 656730

#Mean
mean_total_cast_likes = mean(total_cast_likes,na.rm = TRUE)
mean_total_cast_likes

## [1] 9699.064

#Standard Deviation
sd_total_cast_likes = sd(total_cast_likes,na.rm = TRUE)
sd_total_cast_likes

## [1] 18163.8

#Variance
var_total_cast_likes = var(total_cast_likes,na.rm = TRUE)
var_total_cast_likes

## [1] 329923599

The movie with the most total cast likes, Anchorman: The Legend of Ron Burgundy, located in the 1,903rd row, clocked in at 656,720 million likes, whereas the movie with the least total cast likes, located in the 2,242nd row, Yu-Gi-Oh! Duel Monsters, recorded 0 user reviews. The range for total cast likes is 656,730, the mean is 9,699.064, the standard deviation is 18,163.8, and the variation is 329,923,599.

Director Facebook Likes

director_fb_likes = mydata$director_fb_likes
#Max Director fb Likes
max_director_fb_likes = max(director_fb_likes,na.rm = TRUE)
max_director_fb_likes

## [1] 23000

which.max(director_fb_likes)

## [1] 3673

#Min Director fb Likes
min_director_fb_likes = min(director_fb_likes,na.rm = TRUE)
min_director_fb_likes

## [1] 0

which.min(director_fb_likes)

## [1] 1

#Range
max_director_fb_likes - min_director_fb_likes

## [1] 23000

#Mean
mean_director_fb_likes = mean(director_fb_likes,na.rm = TRUE)
mean_director_fb_likes

## [1] 686.5092

#Standard Deviation
sd_director_fb_likes = sd(director_fb_likes,na.rm = TRUE)
sd_director_fb_likes

## [1] 2813.329

#Variance
var_director_fb_likes = var(director_fb_likes,na.rm = TRUE)
var_director_fb_likes

## [1] 7914818

The movie with the most director Facebook likes, Don Jon, located in the 3,673rd row, clocked in at 2.3 million likes, whereas the movie with the least director Facebook likes, located in the first row, Avatar, recorded 0 user reviews. The range for director Facebook likes is 2.3 million, the mean is 686,509.2, the standard deviation is 2,813,329, and the variation is 7,914,818,000.

Critic Reviews

critic_reviews = mydata$critic_reviews
#Max Critic Reviews
max_critic_reviews = max(critic_reviews,na.rm = TRUE)
max_critic_reviews

## [1] 813

which.max(critic_reviews)

## [1] 4

#Min Critic Reviews
min_critic_reviews = min(critic_reviews,na.rm = TRUE)
min_critic_reviews

## [1] 1

which.min(critic_reviews)

## [1] 99

#Range
max_critic_reviews - min_critic_reviews

## [1] 812

#Mean
mean_critic_reviews = mean(critic_reviews,na.rm = TRUE)
mean_critic_reviews

## [1] 140.1943

#Standard Deviation
sd_critic_reviews = sd(critic_reviews,na.rm = TRUE)
sd_critic_reviews

## [1] 121.6017

#Variance
var_critic_reviews = var(critic_reviews,na.rm = TRUE)
var_critic_reviews

## [1] 14786.97

The movie with the most critic reviews, The Dark Knight, located in the fourth row, clocked in at 813 critic reviews, whereas the movie with the least critic reviews, located in the 99th row, Godzilla Resurgence, recorded 1 critic review. The range for critic reviews is 812, the mean is 140.1943, the standard deviation is 121.6017, and the variation is 14,786.97.

Task 2

An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.

summary(mydata)

##                          ï..title                     genres    
##  Ben-HurÃ<U+0082>Â                  :   3   Drama               : 236  
##  HalloweenÃ<U+0082>Â                :   3   Comedy              : 209  
##  HomeÃ<U+0082>Â                     :   3   Comedy|Drama        : 191  
##  King KongÃ<U+0082>Â                :   3   Comedy|Drama|Romance: 187  
##  PanÃ<U+0082>Â                      :   3   Comedy|Romance      : 158  
##  The Fast and the FuriousÃ<U+0082>Â :   3   Drama|Romance       : 152  
##  (Other)                     :5025   (Other)             :3910  
##              director                  actor1                 actor2    
##                  : 104   Robert De Niro   :  49   Morgan Freeman :  20  
##  Steven Spielberg:  26   Johnny Depp      :  41   Charlize Theron:  15  
##  Woody Allen     :  22   Nicolas Cage     :  33   Brad Pitt      :  14  
##  Clint Eastwood  :  20   J.K. Simmons     :  31                  :  13  
##  Martin Scorsese :  20   Bruce Willis     :  30   James Franco   :  11  
##  Ridley Scott    :  17   Denzel Washington:  30   Meryl Streep   :  11  
##  (Other)         :4834   (Other)          :4829   (Other)        :4959  
##             actor3         length          budget         
##                :  23   Min.   :  7.0   Min.   :2.180e+02  
##  Ben Mendelsohn:   8   1st Qu.: 93.0   1st Qu.:6.000e+06  
##  John Heard    :   8   Median :103.0   Median :2.000e+07  
##  Steve Coogan  :   8   Mean   :107.2   Mean   :3.975e+07  
##  Anne Hathaway :   7   3rd Qu.:118.0   3rd Qu.:4.500e+07  
##  Jon Gries     :   7   Max.   :511.0   Max.   :1.222e+10  
##  (Other)       :4982   NA's   :15      NA's   :492        
##  director_fb_likes actor1_fb_likes  actor2_fb_likes  actor3_fb_likes  
##  Min.   :    0.0   Min.   :     0   Min.   :     0   Min.   :    0.0  
##  1st Qu.:    7.0   1st Qu.:   614   1st Qu.:   281   1st Qu.:  133.0  
##  Median :   49.0   Median :   988   Median :   595   Median :  371.5  
##  Mean   :  686.5   Mean   :  6560   Mean   :  1652   Mean   :  645.0  
##  3rd Qu.:  194.5   3rd Qu.: 11000   3rd Qu.:   918   3rd Qu.:  636.0  
##  Max.   :23000.0   Max.   :640000   Max.   :137000   Max.   :23000.0  
##  NA's   :104       NA's   :7        NA's   :13       NA's   :23       
##  total_cast_likes    fb_likes      critic_reviews  users_reviews   
##  Min.   :     0   Min.   :     0   Min.   :  1.0   Min.   :   1.0  
##  1st Qu.:  1411   1st Qu.:     0   1st Qu.: 50.0   1st Qu.:  65.0  
##  Median :  3090   Median :   166   Median :110.0   Median : 156.0  
##  Mean   :  9699   Mean   :  7526   Mean   :140.2   Mean   : 272.8  
##  3rd Qu.: 13756   3rd Qu.:  3000   3rd Qu.:195.0   3rd Qu.: 326.0  
##  Max.   :656730   Max.   :349000   Max.   :813.0   Max.   :5060.0  
##                                    NA's   :50      NA's   :21      
##   users_votes          score        aspect_ratio       gross          
##  Min.   :      5   Min.   :1.600   Min.   : 1.18   Min.   :      162  
##  1st Qu.:   8594   1st Qu.:5.800   1st Qu.: 1.85   1st Qu.:  5340988  
##  Median :  34359   Median :6.600   Median : 2.35   Median : 25517500  
##  Mean   :  83668   Mean   :6.442   Mean   : 2.22   Mean   : 48468408  
##  3rd Qu.:  96309   3rd Qu.:7.200   3rd Qu.: 2.35   3rd Qu.: 62309438  
##  Max.   :1689764   Max.   :9.500   Max.   :16.00   Max.   :760505847  
##                                    NA's   :329     NA's   :884        
##       year     
##  Min.   :1916  
##  1st Qu.:1999  
##  Median :2005  
##  Mean   :2002  
##  3rd Qu.:2011  
##  Max.   :2016  
##  NA's   :108

Thank you for teaching me a costly lesson in efficiency.

There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.

Now, we will produce a basic plot of the ‘gross’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.

plot(gross, main = "Gross Sales")

When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing gross.

#xlab labels the x axis, ylab labels the y axis
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")

There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.

Now, lets plot the gross graph, alongside user votes, total cast likes, director facebook likes, and critic reviews, which you will code. Make sure to run the code in the same chunk so they are on the same layout.

#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,2,2,1))

## Warning in matrix(1:5, 2, 2, 1): data length [5] is not a sub-multiple or
## multiple of the number of rows [2]

#Example of how to plot the gross variable
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000") 

#Plot of User Votes
plot(users_votes, type = "b", xlab = "Case Number", ylab = "Votes in Thousands")

#Plot of Total Cast Likes
plot(total_cast_likes, type = "b", xlab = "Case Number", ylab = "Total Cast Likes in Thousands")

#Plot of Director Facebook Likes
plot(director_fb_likes, type = "b", xlab = "Case Number", ylab = "Director Facebook Likes in Thousands")

#Plot of Critic Reviews
plot(critic_reviews, type = "b", xlab = "Case Number", ylab = "Critic Reviews")

The 20 months of case_number are in no particular order and not related to a chronological time sequence. They are simply 20 independent use case studies. Since each case is independent, we can reorder them. To reveal a potential trend, consider reordering the gross column from low to high and see how the other four variables behave.

newdata = mydata[order(gross),]
newgross = newdata$gross
newusersvotes = newdata$users_votes
newtotalcastlikes = newdata$total_cast_likes
newdirectorfblikes = newdata$director_fb_likes
newcriticreviews = newdata$critic_reviews

#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:6,2,2,2))

#Example of how to plot the gross variable
plot(newgross, type="b", xlab = "Case Number", ylab = "Gross in $1,000") 

#Plot of User Votes
plot(newusersvotes, type = "b", xlab = "Case Number", ylab = "Votes in Thousands")

#Plot of Total Cast Likes
plot(newtotalcastlikes, type = "b", xlab = "Case Number", ylab = "Total Cast Likes in Thousands")

#Plot of Director Facebook Likes
plot(newdirectorfblikes, type = "b", xlab = "Case Number", ylab = "Director Facebook Likes in Thousands")

#Plot of Critic Reviews
plot(newcriticreviews, type = "b", xlab = "Case Number", ylab = "Critic Reviews")

plot(mydata$users_votes~gross, xlab = "Gross in $1,000", ylab = "User Votes in 1,000" )
abline(lm(mydata$users_votes~gross), col="red")

Task 3

Given a gross value of 10214013, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.

We know that the z-score = (x - mean)/sd. So, input this into the R code where x=10214013, mean=48468408, and stdev = 68452990 which we found above.

x = 10214013
zscore = (x - mean_gross)/sd_gross
zscore

## [1] -0.5588418

Based on the z-values, a $10214013 gross value is poor performance because its z-score returns a negative value. This means that this data point is placed a little over one-half standard deviations below the mean for this metric, hence, its performance being considered under par, or poor.

Business Analytics Lab Worksheet 03

CME Group Foundation Business Analytics Lab

Matthew Krishnan

26 July 2017

About

Setup

Task 1

Task 2

Task 3