Quantitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.
Qualitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.
In this lab, we will explore the marketing data set and understand it better through simple statistics.
Make sure to download the folder titled ‘bsad_lab03’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, we begin by reading in the data from the ‘rottentomatoes.csv’ file, and viewing it to make sure we see it being read in correctly.
mydata = read.csv(file="data/rottentomatoes.csv")
head(mydata)
## ï..title
## 1 AvatarÃ<U+0082>ÂÂ
## 2 Pirates of the Caribbean: At World's EndÃ<U+0082>ÂÂ
## 3 SpectreÃ<U+0082>ÂÂ
## 4 The Dark Knight RisesÃ<U+0082>ÂÂ
## 5 Star Wars: Episode VII - The Force AwakensÃ<U+0082>ÂÂ
## 6 John CarterÃ<U+0082>ÂÂ
## genres director actor1
## 1 Action|Adventure|Fantasy|Sci-Fi James Cameron CCH Pounder
## 2 Action|Adventure|Fantasy Gore Verbinski Johnny Depp
## 3 Action|Adventure|Thriller Sam Mendes Christoph Waltz
## 4 Action|Thriller Christopher Nolan Tom Hardy
## 5 Documentary Doug Walker Doug Walker
## 6 Action|Adventure|Sci-Fi Andrew Stanton Daryl Sabara
## actor2 actor3 length budget director_fb_likes
## 1 Joel David Moore Wes Studi 178 237000000 0
## 2 Orlando Bloom Jack Davenport 169 300000000 563
## 3 Rory Kinnear Stephanie Sigman 148 245000000 0
## 4 Christian Bale Joseph Gordon-Levitt 164 250000000 22000
## 5 Rob Walker NA NA 131
## 6 Samantha Morton Polly Walker 132 263700000 475
## actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1 1000 936 855 4834
## 2 40000 5000 1000 48350
## 3 11000 393 161 11700
## 4 27000 23000 23000 106759
## 5 131 12 NA 143
## 6 640 632 530 1873
## fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1 33000 723 3054 886204 7.9 1.78
## 2 0 302 1238 471220 7.1 2.35
## 3 85000 602 994 275868 6.8 2.35
## 4 164000 813 2701 1144337 8.5 2.35
## 5 0 NA NA 8 7.1 NA
## 6 24000 462 738 212204 6.6 2.35
## gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5 NA NA
## 6 73058679 2012
Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘gross’. Follow the example and do the same for user votes, total cast likes, director facebook likes, and critic reviews.
Gross
gross = mydata$gross
#Max Gross
max_gross = max(gross,na.rm = TRUE)
max_gross
## [1] 760505847
which.max(gross)
## [1] 1
#Min Gross
min_gross = min(gross,na.rm = TRUE)
min_gross
## [1] 162
which.min(gross)
## [1] 3331
#Range
max_gross - min_gross
## [1] 760505685
#Mean
mean_gross = mean(gross,na.rm = TRUE)
mean_gross
## [1] 48468408
#Standard Deviation
sd_gross = sd(gross,na.rm = TRUE)
sd_gross
## [1] 68452990
#Variance
var_gross = var(gross,na.rm = TRUE)
var_gross
## [1] 4.685812e+15
Here, the maximum gross in all of the movies is 760.51 billion dollars, located in the first row, which is Avatar. The movie that earned the least amount of sales is Skin Trade, grossing at 162 thousand dollars.
User Votes
users_votes = mydata$users_votes
#Max User Votes
max_users_votes = max(users_votes,na.rm = TRUE)
max_users_votes
## [1] 1689764
which.max(users_votes)
## [1] 1938
#Min User Votes
min_users_votes = min(users_votes,na.rm = TRUE)
min_users_votes
## [1] 5
which.min(users_votes)
## [1] 4703
#Range
max_users_votes - min_users_votes
## [1] 1689759
#Mean
mean_users_votes = mean(users_votes,na.rm = TRUE)
mean_users_votes
## [1] 83668.16
#Standard Deviation
sd_users_votes = sd(users_votes,na.rm = TRUE)
sd_users_votes
## [1] 138485.3
#Variance
var_users_votes = var(users_votes,na.rm = TRUE)
var_users_votes
## [1] 19178166353
The movie with the most user votes, The Shawshank Redemption, located in the 1,938th row, clocked in at 1.69 million votes, whereas the movie with the least user votes, located in the 4,703rd row, The Hadza: The Last of the First, recorded 5 user reviews. The range for user likes is 1,689,759, the mean is 83,668.16, the standard deviation is 138,485.3, and the variation is 19,178,166,353.
Total Cast Likes
total_cast_likes = mydata$total_cast_likes
#Max Total Cast Likes
max_total_cast_likes = max(total_cast_likes,na.rm = TRUE)
max_total_cast_likes
## [1] 656730
which.max(total_cast_likes)
## [1] 1903
#Min Total Cast Likes
min_total_cast_likes = min(total_cast_likes,na.rm = TRUE)
min_total_cast_likes
## [1] 0
which.min(total_cast_likes)
## [1] 2242
#Range
max_total_cast_likes - min_total_cast_likes
## [1] 656730
#Mean
mean_total_cast_likes = mean(total_cast_likes,na.rm = TRUE)
mean_total_cast_likes
## [1] 9699.064
#Standard Deviation
sd_total_cast_likes = sd(total_cast_likes,na.rm = TRUE)
sd_total_cast_likes
## [1] 18163.8
#Variance
var_total_cast_likes = var(total_cast_likes,na.rm = TRUE)
var_total_cast_likes
## [1] 329923599
The movie with the most total cast likes, Anchorman: The Legend of Ron Burgundy, located in the 1,903rd row, clocked in at 656,720 million likes, whereas the movie with the least total cast likes, located in the 2,242nd row, Yu-Gi-Oh! Duel Monsters, recorded 0 user reviews. The range for total cast likes is 656,730, the mean is 9,699.064, the standard deviation is 18,163.8, and the variation is 329,923,599.
Director Facebook Likes
director_fb_likes = mydata$director_fb_likes
#Max Director fb Likes
max_director_fb_likes = max(director_fb_likes,na.rm = TRUE)
max_director_fb_likes
## [1] 23000
which.max(director_fb_likes)
## [1] 3673
#Min Director fb Likes
min_director_fb_likes = min(director_fb_likes,na.rm = TRUE)
min_director_fb_likes
## [1] 0
which.min(director_fb_likes)
## [1] 1
#Range
max_director_fb_likes - min_director_fb_likes
## [1] 23000
#Mean
mean_director_fb_likes = mean(director_fb_likes,na.rm = TRUE)
mean_director_fb_likes
## [1] 686.5092
#Standard Deviation
sd_director_fb_likes = sd(director_fb_likes,na.rm = TRUE)
sd_director_fb_likes
## [1] 2813.329
#Variance
var_director_fb_likes = var(director_fb_likes,na.rm = TRUE)
var_director_fb_likes
## [1] 7914818
The movie with the most director Facebook likes, Don Jon, located in the 3,673rd row, clocked in at 2.3 million likes, whereas the movie with the least director Facebook likes, located in the first row, Avatar, recorded 0 user reviews. The range for director Facebook likes is 2.3 million, the mean is 686,509.2, the standard deviation is 2,813,329, and the variation is 7,914,818,000.
Critic Reviews
critic_reviews = mydata$critic_reviews
#Max Critic Reviews
max_critic_reviews = max(critic_reviews,na.rm = TRUE)
max_critic_reviews
## [1] 813
which.max(critic_reviews)
## [1] 4
#Min Critic Reviews
min_critic_reviews = min(critic_reviews,na.rm = TRUE)
min_critic_reviews
## [1] 1
which.min(critic_reviews)
## [1] 99
#Range
max_critic_reviews - min_critic_reviews
## [1] 812
#Mean
mean_critic_reviews = mean(critic_reviews,na.rm = TRUE)
mean_critic_reviews
## [1] 140.1943
#Standard Deviation
sd_critic_reviews = sd(critic_reviews,na.rm = TRUE)
sd_critic_reviews
## [1] 121.6017
#Variance
var_critic_reviews = var(critic_reviews,na.rm = TRUE)
var_critic_reviews
## [1] 14786.97
The movie with the most critic reviews, The Dark Knight, located in the fourth row, clocked in at 813 critic reviews, whereas the movie with the least critic reviews, located in the 99th row, Godzilla Resurgence, recorded 1 critic review. The range for critic reviews is 812, the mean is 140.1943, the standard deviation is 121.6017, and the variation is 14,786.97.
An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.
summary(mydata)
## ï..title genres
## Ben-HurÃ<U+0082> : 3 Drama : 236
## HalloweenÃ<U+0082> : 3 Comedy : 209
## HomeÃ<U+0082> : 3 Comedy|Drama : 191
## King KongÃ<U+0082> : 3 Comedy|Drama|Romance: 187
## PanÃ<U+0082> : 3 Comedy|Romance : 158
## The Fast and the FuriousÃ<U+0082> : 3 Drama|Romance : 152
## (Other) :5025 (Other) :3910
## director actor1 actor2
## : 104 Robert De Niro : 49 Morgan Freeman : 20
## Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15
## Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14
## Clint Eastwood : 20 J.K. Simmons : 31 : 13
## Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11
## Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11
## (Other) :4834 (Other) :4829 (Other) :4959
## actor3 length budget
## : 23 Min. : 7.0 Min. :2.180e+02
## Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06
## John Heard : 8 Median :103.0 Median :2.000e+07
## Steve Coogan : 8 Mean :107.2 Mean :3.975e+07
## Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07
## Jon Gries : 7 Max. :511.0 Max. :1.222e+10
## (Other) :4982 NA's :15 NA's :492
## director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
## Median : 49.0 Median : 988 Median : 595 Median : 371.5
## Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
## 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
## Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
## NA's :104 NA's :7 NA's :13 NA's :23
## total_cast_likes fb_likes critic_reviews users_reviews
## Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0
## Median : 3090 Median : 166 Median :110.0 Median : 156.0
## Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8
## 3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0
## Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0
## NA's :50 NA's :21
## users_votes score aspect_ratio gross
## Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162
## 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988
## Median : 34359 Median :6.600 Median : 2.35 Median : 25517500
## Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408
## 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438
## Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847
## NA's :329 NA's :884
## year
## Min. :1916
## 1st Qu.:1999
## Median :2005
## Mean :2002
## 3rd Qu.:2011
## Max. :2016
## NA's :108
Thank you for teaching me a costly lesson in efficiency.
There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.
Now, we will produce a basic plot of the ‘gross’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.
plot(gross, main = "Gross Sales")
When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing gross.
#xlab labels the x axis, ylab labels the y axis
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")
There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.
Now, lets plot the gross graph, alongside user votes, total cast likes, director facebook likes, and critic reviews, which you will code. Make sure to run the code in the same chunk so they are on the same layout.
#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,2,2,1))
## Warning in matrix(1:5, 2, 2, 1): data length [5] is not a sub-multiple or
## multiple of the number of rows [2]
#Example of how to plot the gross variable
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")
#Plot of User Votes
plot(users_votes, type = "b", xlab = "Case Number", ylab = "Votes in Thousands")
#Plot of Total Cast Likes
plot(total_cast_likes, type = "b", xlab = "Case Number", ylab = "Total Cast Likes in Thousands")
#Plot of Director Facebook Likes
plot(director_fb_likes, type = "b", xlab = "Case Number", ylab = "Director Facebook Likes in Thousands")
#Plot of Critic Reviews
plot(critic_reviews, type = "b", xlab = "Case Number", ylab = "Critic Reviews")
The 20 months of case_number are in no particular order and not related to a chronological time sequence. They are simply 20 independent use case studies. Since each case is independent, we can reorder them. To reveal a potential trend, consider reordering the gross column from low to high and see how the other four variables behave.
newdata = mydata[order(gross),]
newgross = newdata$gross
newusersvotes = newdata$users_votes
newtotalcastlikes = newdata$total_cast_likes
newdirectorfblikes = newdata$director_fb_likes
newcriticreviews = newdata$critic_reviews
#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:6,2,2,2))
#Example of how to plot the gross variable
plot(newgross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")
#Plot of User Votes
plot(newusersvotes, type = "b", xlab = "Case Number", ylab = "Votes in Thousands")
#Plot of Total Cast Likes
plot(newtotalcastlikes, type = "b", xlab = "Case Number", ylab = "Total Cast Likes in Thousands")
#Plot of Director Facebook Likes
plot(newdirectorfblikes, type = "b", xlab = "Case Number", ylab = "Director Facebook Likes in Thousands")
#Plot of Critic Reviews
plot(newcriticreviews, type = "b", xlab = "Case Number", ylab = "Critic Reviews")
plot(mydata$users_votes~gross, xlab = "Gross in $1,000", ylab = "User Votes in 1,000" )
abline(lm(mydata$users_votes~gross), col="red")
Given a gross value of 10214013, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.
We know that the z-score = (x - mean)/sd. So, input this into the R code where x=10214013, mean=48468408, and stdev = 68452990 which we found above.
x = 10214013
zscore = (x - mean_gross)/sd_gross
zscore
## [1] -0.5588418
Based on the z-values, a $10214013 gross value is poor performance because its z-score returns a negative value. This means that this data point is placed a little over one-half standard deviations below the mean for this metric, hence, its performance being considered under par, or poor.