Quantitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.
Qualitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.
In this lab, we will explore the marketing data set and understand it better through simple statistics.
Make sure to download the folder titled ‘bsad_lab03’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure we see it being read in correctly.
mydata = read.csv(file="bsad_lab03/rottentomatoes.csv")
head(mydata)
Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘sales’. Follow the example and do the same for radio, paper, tv, and pos.
Gross
gross = mydata$gross
#Max Sales
maxgross = max(gross, na.rm = TRUE)
maxgross
[1] 760505847
#Min Sales
mingross = min(gross, na.rm = TRUE)
mingross
[1] 162
#Range
rangegross = maxgross-mingross
rangegross
[1] 760505685
#Mean
meangross = mean(gross, na.rm = TRUE)
meangross
[1] 48468408
#Standard Deviation
sdgross = sd(gross, na.rm = TRUE)
sdgross
[1] 68452990
#Variance
vargross = var(gross, na.rm = TRUE)
vargross
[1] 4.685812e+15
User_Votes
votes=mydata$users_votes
max = max(votes)
max
[1] 1689764
min = min(votes)
min
[1] 5
max-min
[1] 1689759
mean(votes)
[1] 83668.16
sd(votes)
[1] 138485.3
var(votes)
[1] 19178166353
total_cast_likes
castlikes=mydata$total_cast_likes
max = max(castlikes)
max
[1] 656730
min = min(castlikes)
min
[1] 0
max-min
[1] 656730
mean(castlikes)
[1] 9699.064
sd(castlikes)
[1] 18163.8
var(castlikes)
[1] 329923599
director_fb_likes
directorlikes=mydata$director_fb_likes
maxdl = max(directorlikes, na.rm = TRUE)
maxdl
[1] 23000
mindl = min(directorlikes, na.rm = TRUE)
mindl
[1] 0
rangedl = maxdl-mindl
rangedl
[1] 23000
meandl=mean(directorlikes, na.rm = TRUE)
meandl
[1] 686.5092
sddl=sd(directorlikes, na.rm = TRUE)
sddl
[1] 2813.329
vardl=var(directorlikes, na.rm = TRUE)
vardl
[1] 7914818
critic_reviews
criticreviews=mydata$critic_reviews
maxcr = max(criticreviews, na.rm = TRUE)
maxcr
[1] 813
mincr = min(criticreviews, na.rm = TRUE)
mincr
[1] 1
rangecr = max-min
rangecr
[1] 656730
meancr=mean(criticreviews, na.rm = TRUE)
meancr
[1] 140.1943
sdcr=sd(criticreviews, na.rm = TRUE)
sdcr
[1] 121.6017
varcr=var(criticreviews, na.rm = TRUE)
varcr
[1] 14786.97
Follow the example and do the same for radio, paper, tv, and pos. Use the worksheet given to try the different commands to find the max, min, and range.
An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.
summary(mydata)
title genres director
Ben-Hur : 3 Drama : 236 : 104
Halloween : 3 Comedy : 209 Steven Spielberg: 26
Home : 3 Comedy|Drama : 191 Woody Allen : 22
King Kong : 3 Comedy|Drama|Romance: 187 Clint Eastwood : 20
Pan : 3 Comedy|Romance : 158 Martin Scorsese : 20
The Fast and the Furious : 3 Drama|Romance : 152 Ridley Scott : 17
(Other) :5025 (Other) :3910 (Other) :4834
actor1 actor2 actor3 length
Robert De Niro : 49 Morgan Freeman : 20 : 23 Min. : 7.0
Johnny Depp : 41 Charlize Theron: 15 Ben Mendelsohn: 8 1st Qu.: 93.0
Nicolas Cage : 33 Brad Pitt : 14 John Heard : 8 Median :103.0
J.K. Simmons : 31 : 13 Steve Coogan : 8 Mean :107.2
Bruce Willis : 30 James Franco : 11 Anne Hathaway : 7 3rd Qu.:118.0
Denzel Washington: 30 Meryl Streep : 11 Jon Gries : 7 Max. :511.0
(Other) :4829 (Other) :4959 (Other) :4982 NA's :15
budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
Min. :2.180e+02 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.:6.000e+06 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
Median :2.000e+07 Median : 49.0 Median : 988 Median : 595 Median : 371.5
Mean :3.975e+07 Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
3rd Qu.:4.500e+07 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
Max. :1.222e+10 Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
NA's :492 NA's :104 NA's :7 NA's :13 NA's :23
total_cast_likes fb_likes critic_reviews users_reviews users_votes score
Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5 Min. :1.600
1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594 1st Qu.:5.800
Median : 3090 Median : 166 Median :110.0 Median : 156.0 Median : 34359 Median :6.600
Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668 Mean :6.442
3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309 3rd Qu.:7.200
Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764 Max. :9.500
NA's :50 NA's :21
aspect_ratio gross year
Min. : 1.18 Min. : 162 Min. :1916
1st Qu.: 1.85 1st Qu.: 5340988 1st Qu.:1999
Median : 2.35 Median : 25517500 Median :2005
Mean : 2.22 Mean : 48468408 Mean :2002
3rd Qu.: 2.35 3rd Qu.: 62309438 3rd Qu.:2011
Max. :16.00 Max. :760505847 Max. :2016
NA's :329 NA's :884 NA's :108
There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.
Now, we will produce a basic blot of the ‘sales’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.
plot(gross)
plot(votes)
plot(castlikes)
plot(directorlikes)
plot(criticreviews)
When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing sales.
#xlab labels the x axis, ylab labels the y axis
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")
There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.
Now, lets plot the sales graph, alongside radio, paper, and tv which you will code. Make sure to run the code in the same chunk so they are on the same layout.
#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,1,5))
#Example of how to plot the sales variable
plot(gross, type="b", xlab = "Case Number", ylab = "Gross")
#Plot of Radio
plot(votes, type="b", xlab = "Case Number", ylab = "User Votes")
#Plot of Paper
plot(castlikes, type="b", xlab = "Case Number", ylab = "Cast Likes")
#Plot of TV
plot(directorlikes, type="b", xlab = "Case Number", ylab = "Director Likes")
plot(criticreviews, type="b", xlab = "Case Number", ylab = "Critic Reviews")
The 20 months of case_number are in no particular order and not related to a chronological time sequence. They are simply 20 independent use case studies. Since each case is independent, we can reorder them. To reveal a potential trend, consider reordering the sales column from low to high and see how the other four variables behave.
newdata = mydata[order(gross),]
newgross = newdata$gross
newvotes = newdata$users_votes
newcastlikes = newdata$total_cast_likes
newdirectorlikes = newdata$director_fb_likes
newcriticreviews = newdata$critic_reviews
#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,1,5))
#Example of how to plot the sales variable
plot(newgross, type="b", xlab = "Case Number", ylab = "Gross")
#Plot of Radio
plot(newvotes, type="b", xlab = "Case Number", ylab = "User Votes")
#Plot of Paper
plot(newcastlikes, type="b", xlab = "Case Number", ylab = "Cast Likes")
#Plot of TV
plot(newdirectorlikes, type="b", xlab = "Case Number", ylab = "Director Likes")
plot(newcriticreviews, type="b", xlab = "Case Number", ylab = "Critic Reviews")
Given a sales value of 25000, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.
We know that the z-score = (x - mean)/sd. So, input this into the R code where x=25000, mean=16717.2, and stdev = 2617.0521 which we found above.
zgross = (25000-meangross)/sdgross
zgross
[1] -0.7076887
hist((gross-meangross)/sdgross)
$25000 sales value: poor, average, good, or very good performance? Explain your logic.Based on the z-value of -0.7076887 for x=25000, the gross value is poor. This z-score is below the average z-score meaning it is bad.