Quantitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.
Qualitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.
In this lab, we will explore the marketing data set and understand it better through simple statistics.
Make sure to download the folder titled ‘bsad_lab03’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure we see it being read in correctly.
mydata = read.csv(file="data/rottentomatoes.csv")
head(mydata)
title = mydata$title
title = sub("Â","",title)
head(title)
[1] "Avatar "
[2] "Pirates of the Caribbean: At World's End "
[3] "Spectre "
[4] "The Dark Knight Rises "
[5] "Star Wars: Episode VII - The Force Awakens "
[6] "John Carter "
Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘sales’. Follow the example and do the same for radio, paper, tv, and pos.
UsersVotes = mydata$users_votes
head(UsersVotes)
[1] 886204 471220 275868 1144337 8 212204
#Max
maxUsersVotes = max(UsersVotes)
maxUsersVotes
[1] 1689764
#Min
minUsersVotes = min(UsersVotes)
minUsersVotes
[1] 5
#Mean
averageUsersVotes = mean(UsersVotes)
averageUsersVotes
[1] 83668.16
#Range
rangeUsersVotes = maxUsersVotes - minUsersVotes
rangeUsersVotes
[1] 1689759
#Standard Deviation
sdUsersVotes = sd(UsersVotes)
sdUsersVotes
[1] 138485.3
#Variance
varUsersVotes = var(UsersVotes)
varUsersVotes
[1] 19178166353
Director Facebook Likes
TotalCastLikes = mydata$total_cast_likes
head(TotalCastLikes)
[1] 4834 48350 11700 106759 143 1873
#Max
maxTotalCastLikes = max(TotalCastLikes)
maxTotalCastLikes
[1] 656730
#Min
minTotalCastLikes = min(TotalCastLikes)
minTotalCastLikes
[1] 0
#Mean
averageTotalCastLikes = mean(TotalCastLikes)
averageTotalCastLikes
[1] 9699.064
#Range
rangeTotalCastLikes = maxTotalCastLikes - minTotalCastLikes
rangeTotalCastLikes
[1] 656730
#Standard Deviation
sdTotalCastLikes = sd(TotalCastLikes)
sdTotalCastLikes
[1] 18163.8
#Variance
varTotalCastLikes = var(TotalCastLikes)
varTotalCastLikes
[1] 329923599
DirectorFacebookLikes = mydata$director_fb_likes
DirectorFacebookLikes = DirectorFacebookLikes[!is.na(DirectorFacebookLikes)]
head(DirectorFacebookLikes)
[1] 0 563 0 22000 131 475
#Max
maxDirectorFacebookLikes = max(DirectorFacebookLikes)
maxDirectorFacebookLikes
[1] 23000
#Min
minDirectorFacebookLikes = min(DirectorFacebookLikes)
minDirectorFacebookLikes
[1] 0
#Mean
averageDirectorFacebookLikes = mean(DirectorFacebookLikes)
averageDirectorFacebookLikes
[1] 686.5092
summary(mydata)
title genres director
Ben-Hur : 3 Drama : 236 : 104
Halloween : 3 Comedy : 209 Steven Spielberg: 26
Home : 3 Comedy|Drama : 191 Woody Allen : 22
King Kong : 3 Comedy|Drama|Romance: 187 Clint Eastwood : 20
Pan : 3 Comedy|Romance : 158 Martin Scorsese : 20
The Fast and the Furious : 3 Drama|Romance : 152 Ridley Scott : 17
(Other) :5025 (Other) :3910 (Other) :4834
actor1 actor2 actor3 length
Robert De Niro : 49 Morgan Freeman : 20 : 23 Min. : 7.0
Johnny Depp : 41 Charlize Theron: 15 Ben Mendelsohn: 8 1st Qu.: 93.0
Nicolas Cage : 33 Brad Pitt : 14 John Heard : 8 Median :103.0
J.K. Simmons : 31 : 13 Steve Coogan : 8 Mean :107.2
Bruce Willis : 30 James Franco : 11 Anne Hathaway : 7 3rd Qu.:118.0
Denzel Washington: 30 Meryl Streep : 11 Jon Gries : 7 Max. :511.0
(Other) :4829 (Other) :4959 (Other) :4982 NA's :15
budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
Min. :2.180e+02 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.:6.000e+06 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
Median :2.000e+07 Median : 49.0 Median : 988 Median : 595 Median : 371.5
Mean :3.975e+07 Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
3rd Qu.:4.500e+07 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
Max. :1.222e+10 Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
NA's :492 NA's :104 NA's :7 NA's :13 NA's :23
total_cast_likes fb_likes critic_reviews users_reviews users_votes
Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5
1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594
Median : 3090 Median : 166 Median :110.0 Median : 156.0 Median : 34359
Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668
3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309
Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764
NA's :50 NA's :21
score aspect_ratio gross year
Min. :1.600 Min. : 1.18 Min. : 162 Min. :1916
1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988 1st Qu.:1999
Median :6.600 Median : 2.35 Median : 25517500 Median :2005
Mean :6.442 Mean : 2.22 Mean : 48468408 Mean :2002
3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438 3rd Qu.:2011
Max. :9.500 Max. :16.00 Max. :760505847 Max. :2016
NA's :329 NA's :884 NA's :108
#Range
rangeDirectorFacebookLikes = maxDirectorFacebookLikes - minDirectorFacebookLikes
rangeDirectorFacebookLikes
[1] 23000
#Standard Deviation
sdDirectorFacebookLikes = sd(DirectorFacebookLikes)
sdDirectorFacebookLikes
#Variance
varDirectorFacebookLikes = var(DirectorFacebookLikes)
varDirectorFacebookLikes
gross = mydata$gross
gross = gross[!is.na(gross)]
head(gross)
#Max
maxGross = max(gross)
maxGross
#Min
minGross = min(gross)
minGross
#Mean
averageGross = mean(gross)
averageGross
#Range
rangeGross = maxGross - minGross
rangeGross
#Standard Deviation
sdGross = sd(gross)
sdGross
#Variance
varGross = var(gross)
varGross
budget = mydata$budget
budget = budget[!is.na(budget)]
head(budget)
[1] 237000000 300000000 245000000 250000000 263700000 258000000
#Maximum
maxbudget = max(budget)
maxbudget
[1] 12215500000
#MIN
minbudget = min(budget)
minbudget
[1] 218
#Mean
averagebudget = mean(budget)
averagebudget
[1] 39752620
#Range
rangebudget = maxbudget - minbudget
rangebudget
[1] 12215499782
#Standard Deviation
sdbudget = sd(budget)
sdbudget
[1] 206114898
#Variance
varbudget = var(budget)
varbudget
[1] 4.248335e+16
Follow the example and do the same for radio, paper, tv, and pos. Use the worksheet given to try the different commands to find the max, min, and range.
An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.
summary(mydata)
title genres director
Ben-Hur : 3 Drama : 236 : 104
Halloween : 3 Comedy : 209 Steven Spielberg: 26
Home : 3 Comedy|Drama : 191 Woody Allen : 22
King Kong : 3 Comedy|Drama|Romance: 187 Clint Eastwood : 20
Pan : 3 Comedy|Romance : 158 Martin Scorsese : 20
The Fast and the Furious : 3 Drama|Romance : 152 Ridley Scott : 17
(Other) :5025 (Other) :3910 (Other) :4834
actor1 actor2 actor3 length
Robert De Niro : 49 Morgan Freeman : 20 : 23 Min. : 7.0
Johnny Depp : 41 Charlize Theron: 15 Ben Mendelsohn: 8 1st Qu.: 93.0
Nicolas Cage : 33 Brad Pitt : 14 John Heard : 8 Median :103.0
J.K. Simmons : 31 : 13 Steve Coogan : 8 Mean :107.2
Bruce Willis : 30 James Franco : 11 Anne Hathaway : 7 3rd Qu.:118.0
Denzel Washington: 30 Meryl Streep : 11 Jon Gries : 7 Max. :511.0
(Other) :4829 (Other) :4959 (Other) :4982 NA's :15
budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
Min. :2.180e+02 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.:6.000e+06 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
Median :2.000e+07 Median : 49.0 Median : 988 Median : 595 Median : 371.5
Mean :3.975e+07 Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
3rd Qu.:4.500e+07 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
Max. :1.222e+10 Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
NA's :492 NA's :104 NA's :7 NA's :13 NA's :23
total_cast_likes fb_likes critic_reviews users_reviews users_votes
Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5
1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594
Median : 3090 Median : 166 Median :110.0 Median : 156.0 Median : 34359
Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668
3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309
Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764
NA's :50 NA's :21
score aspect_ratio gross year
Min. :1.600 Min. : 1.18 Min. : 162 Min. :1916
1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988 1st Qu.:1999
Median :6.600 Median : 2.35 Median : 25517500 Median :2005
Mean :6.442 Mean : 2.22 Mean : 48468408 Mean :2002
3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438 3rd Qu.:2011
Max. :9.500 Max. :16.00 Max. :760505847 Max. :2016
NA's :329 NA's :884 NA's :108
There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.
Now, we will produce a basic blot of the ‘sales’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.
plot(budget)
When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing sales.
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7f9d91fdd150>
## <environment: namespace:graphics>
#Budget is reorder from least to most expensive budget
leastexpensiveBudget = sort(budget)
#xlab labels the x axis, ylab labels the y axis
plot(leastexpensiveBudget, type="b", xlab = "Movie", ylab = "Movie Budget", col.main = "Red")
There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.
Now, lets plot the sales graph, alongside radio, paper, and tv which you will code. Make sure to run the code in the same chunk so they are on the same layout.
layout(matrix(1:4,2,2))
#Director Facebook Likes
ascendingDirectorFacebookLikes = sort(DirectorFacebookLikes)
plot(ascendingDirectorFacebookLikes, type="b", xlab = "Movie", ylab = "Director Facebook Likes")
title ("Director Facebook Likes by Movie")
#Total Cast Likes
ascendingTotalCastLikes = sort(TotalCastLikes)
plot(ascendingTotalCastLikes, type="b", xlab = "Movie", ylab = "Total Cast Likes")
title ("Total Cast Likes by Movie")
#Gross
ascendingUsersVotes = sort(UsersVotes)
plot(ascendingUsersVotes, type="b", xlab = "Movie", ylab = "UsersVotes")
title ("Users Votes by Movie")
#Budget
ascendingBudget = sort(budget)
plot(ascendingBudget, type="b", xlab = "Movie", ylab = "Budget")
title ("Budget")
Given a sales value of 25000, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.
We know that the z-score = (x - mean)/sd. So, input this into the R code where x=25000, mean=16717.2, and stdev = 2617.0521 which we found above.
Based on the z-values, how would you rate a $25000 sales value: poor, average, good, or very good performance? Explain your logic.
#Budget
budgetZscore = (10234001 - averagebudget) / sdbudget
budgetZscore
[1] -0.1432144
#Total Cast Likes
TotalCastLikesZscore = (10234001 - averageTotalCastLikes) / sdTotalCastLikes
TotalCastLikesZscore
[1] 562.8945
#Director Facebook Likes
DirectorFacebookLikesZscore = (10234001 - averageDirectorFacebookLikes) / sdDirectorFacebookLikes
Error: object 'sdDirectorFacebookLikes' not found
#Total Cast Likes
totalcastlikesZscore = (10234001 - averageTotalCastLikes) / sdTotalCastLikes
totalcastlikesZscore
[1] 562.8945
# Budget
zscoreBudget = (budget - averagebudget) / sdbudget
# Total Cast Likes
zscoreTotalCastLikes = (TotalCastLikes - averageTotalCastLikes) / sdTotalCastLikes
# Director Facebook Likes
zscoreDirectorFacebookLikes = (DirectorFacebookLikes - averageDirectorFacebookLikes) / sdDirectorFacebookLikes
Error: object 'sdDirectorFacebookLikes' not found
#Histogram Graphs of Zscores
layout(matrix(1:4, 2,2))
hist(zscoreBudget)
hist(zscoreGross)
Error in hist(zscoreGross) : object 'zscoreGross' not found