About

Quantitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.

Qualitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.

In this lab, we will explore the marketing data set and understand it better through simple statistics.

Setup

Make sure to download the folder titled ‘bsad_lab03’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.


Task 1

First begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure we see it being read in correctly.

mydata = read.csv(file="bsad_lab03/rottentomatoes.csv")
head(mydata)

Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘sales’. Follow the example and do the same for radio, paper, tv, and pos.

Gross

gross = mydata$gross
#Max Sales
maxgross = max(gross, na.rm = TRUE)
maxgross
[1] 760505847
#Min Sales
mingross = min(gross, na.rm = TRUE)
mingross
[1] 162
#Range
rangegross = maxgross-mingross
rangegross
[1] 760505685
#Mean
meangross = mean(gross, na.rm = TRUE)
meangross
[1] 48468408
#Standard Deviation
sdgross = sd(gross, na.rm = TRUE)
sdgross
[1] 68452990
#Variance
vargross = var(gross, na.rm = TRUE)
vargross
[1] 4.685812e+15

User_Votes

votes=mydata$users_votes
max = max(votes)
max
[1] 1689764
min = min(votes)
min
[1] 5
max-min
[1] 1689759
mean(votes)
[1] 83668.16
sd(votes)
[1] 138485.3
var(votes)
[1] 19178166353

total_cast_likes

castlikes=mydata$total_cast_likes
max = max(castlikes)
max
[1] 656730
min = min(castlikes)
min
[1] 0
max-min
[1] 656730
mean(castlikes)
[1] 9699.064
sd(castlikes)
[1] 18163.8
var(castlikes)
[1] 329923599

director_fb_likes

directorlikes=mydata$director_fb_likes
maxdl = max(directorlikes, na.rm = TRUE)
maxdl
[1] 23000
mindl = min(directorlikes, na.rm = TRUE)
mindl
[1] 0
rangedl = maxdl-mindl
rangedl
[1] 23000
meandl=mean(directorlikes, na.rm = TRUE)
meandl
[1] 686.5092
sddl=sd(directorlikes, na.rm = TRUE)
sddl
[1] 2813.329
vardl=var(directorlikes, na.rm = TRUE)
vardl
[1] 7914818

critic_reviews

criticreviews=mydata$critic_reviews
maxcr = max(criticreviews, na.rm = TRUE)
maxcr
[1] 813
mincr = min(criticreviews, na.rm = TRUE)
mincr
[1] 1
rangecr = max-min
rangecr
[1] 656730
meancr=mean(criticreviews, na.rm = TRUE)
meancr
[1] 140.1943
sdcr=sd(criticreviews, na.rm = TRUE)
sdcr
[1] 121.6017
varcr=var(criticreviews, na.rm = TRUE)
varcr
[1] 14786.97

Follow the example and do the same for radio, paper, tv, and pos. Use the worksheet given to try the different commands to find the max, min, and range.


Task 2

An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.

summary(mydata)
                        title                       genres                 director   
 Ben-Hur                  :   3   Drama               : 236                   : 104  
 Halloween                :   3   Comedy              : 209   Steven Spielberg:  26  
 Home                     :   3   Comedy|Drama        : 191   Woody Allen     :  22  
 King Kong                :   3   Comedy|Drama|Romance: 187   Clint Eastwood  :  20  
 Pan                      :   3   Comedy|Romance      : 158   Martin Scorsese :  20  
 The Fast and the Furious :   3   Drama|Romance       : 152   Ridley Scott    :  17  
 (Other)                   :5025   (Other)             :3910   (Other)         :4834  
               actor1                 actor2                actor3         length     
 Robert De Niro   :  49   Morgan Freeman :  20                 :  23   Min.   :  7.0  
 Johnny Depp      :  41   Charlize Theron:  15   Ben Mendelsohn:   8   1st Qu.: 93.0  
 Nicolas Cage     :  33   Brad Pitt      :  14   John Heard    :   8   Median :103.0  
 J.K. Simmons     :  31                  :  13   Steve Coogan  :   8   Mean   :107.2  
 Bruce Willis     :  30   James Franco   :  11   Anne Hathaway :   7   3rd Qu.:118.0  
 Denzel Washington:  30   Meryl Streep   :  11   Jon Gries     :   7   Max.   :511.0  
 (Other)          :4829   (Other)        :4959   (Other)       :4982   NA's   :15     
     budget          director_fb_likes actor1_fb_likes  actor2_fb_likes  actor3_fb_likes  
 Min.   :2.180e+02   Min.   :    0.0   Min.   :     0   Min.   :     0   Min.   :    0.0  
 1st Qu.:6.000e+06   1st Qu.:    7.0   1st Qu.:   614   1st Qu.:   281   1st Qu.:  133.0  
 Median :2.000e+07   Median :   49.0   Median :   988   Median :   595   Median :  371.5  
 Mean   :3.975e+07   Mean   :  686.5   Mean   :  6560   Mean   :  1652   Mean   :  645.0  
 3rd Qu.:4.500e+07   3rd Qu.:  194.5   3rd Qu.: 11000   3rd Qu.:   918   3rd Qu.:  636.0  
 Max.   :1.222e+10   Max.   :23000.0   Max.   :640000   Max.   :137000   Max.   :23000.0  
 NA's   :492         NA's   :104       NA's   :7        NA's   :13       NA's   :23       
 total_cast_likes    fb_likes      critic_reviews  users_reviews     users_votes          score      
 Min.   :     0   Min.   :     0   Min.   :  1.0   Min.   :   1.0   Min.   :      5   Min.   :1.600  
 1st Qu.:  1411   1st Qu.:     0   1st Qu.: 50.0   1st Qu.:  65.0   1st Qu.:   8594   1st Qu.:5.800  
 Median :  3090   Median :   166   Median :110.0   Median : 156.0   Median :  34359   Median :6.600  
 Mean   :  9699   Mean   :  7526   Mean   :140.2   Mean   : 272.8   Mean   :  83668   Mean   :6.442  
 3rd Qu.: 13756   3rd Qu.:  3000   3rd Qu.:195.0   3rd Qu.: 326.0   3rd Qu.:  96309   3rd Qu.:7.200  
 Max.   :656730   Max.   :349000   Max.   :813.0   Max.   :5060.0   Max.   :1689764   Max.   :9.500  
                                   NA's   :50      NA's   :21                                        
  aspect_ratio       gross                year     
 Min.   : 1.18   Min.   :      162   Min.   :1916  
 1st Qu.: 1.85   1st Qu.:  5340988   1st Qu.:1999  
 Median : 2.35   Median : 25517500   Median :2005  
 Mean   : 2.22   Mean   : 48468408   Mean   :2002  
 3rd Qu.: 2.35   3rd Qu.: 62309438   3rd Qu.:2011  
 Max.   :16.00   Max.   :760505847   Max.   :2016  
 NA's   :329     NA's   :884         NA's   :108   

There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.

Now, we will produce a basic blot of the ‘sales’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.

plot(gross)

plot(votes)

plot(castlikes)

plot(directorlikes)

plot(criticreviews)

When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing sales.

#xlab labels the x axis, ylab labels the y axis
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000") 

There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.

Now, lets plot the sales graph, alongside radio, paper, and tv which you will code. Make sure to run the code in the same chunk so they are on the same layout.

#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,1,5))
#Example of how to plot the sales variable
plot(gross, type="b", xlab = "Case Number", ylab = "Gross") 
#Plot of Radio
plot(votes, type="b", xlab = "Case Number", ylab = "User Votes") 
#Plot of Paper
plot(castlikes, type="b", xlab = "Case Number", ylab = "Cast Likes") 
#Plot of TV
plot(directorlikes, type="b", xlab = "Case Number", ylab = "Director Likes") 
plot(criticreviews, type="b", xlab = "Case Number", ylab = "Critic Reviews") 

The 20 months of case_number are in no particular order and not related to a chronological time sequence. They are simply 20 independent use case studies. Since each case is independent, we can reorder them. To reveal a potential trend, consider reordering the sales column from low to high and see how the other four variables behave.

newdata = mydata[order(gross),]
newgross = newdata$gross
newvotes = newdata$users_votes
newcastlikes = newdata$total_cast_likes
newdirectorlikes = newdata$director_fb_likes
newcriticreviews = newdata$critic_reviews
#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,1,5))
#Example of how to plot the sales variable
plot(newgross, type="b", xlab = "Case Number", ylab = "Gross") 
#Plot of Radio
plot(newvotes, type="b", xlab = "Case Number", ylab = "User Votes") 
#Plot of Paper
plot(newcastlikes, type="b", xlab = "Case Number", ylab = "Cast Likes") 
#Plot of TV
plot(newdirectorlikes, type="b", xlab = "Case Number", ylab = "Director Likes") 
plot(newcriticreviews, type="b", xlab = "Case Number", ylab = "Critic Reviews") 


Task 3

Given a sales value of 25000, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.

We know that the z-score = (x - mean)/sd. So, input this into the R code where x=25000, mean=16717.2, and stdev = 2617.0521 which we found above.

zgross = (25000-meangross)/sdgross
zgross
[1] -0.7076887

Histogram of Z-Scores

hist((gross-meangross)/sdgross)

Based on the z-values, how would you rate a $25000 sales value: poor, average, good, or very good performance? Explain your logic.

Based on the z-value of -0.7076887 for x=25000, the gross value is poor. This z-score is below the average z-score meaning it is bad.

