First begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure we see it being read in correctly.
mydata = read.csv(file="data/rottentomatoes.csv")
head(mydata)
Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘gross’.
gross
gross = mydata$gross
#Max Gross
max_gross = max(gross, na.rm = TRUE)
max_gross
[1] 760505847
#Min Gross
min_gross = min(gross, na.rm = TRUE)
min_gross
[1] 162
#Range
range_gross = max_gross-min_gross
range_gross
[1] 760505685
#Mean
mean_gross = mean(gross, na.rm = TRUE)
mean_gross
[1] 48468408
#Standard Deviation
sd_gross = sd(gross, na.rm = TRUE)
sd_gross
[1] 68452990
#Variance
var_gross = var(gross, na.rm = TRUE)
var_gross
[1] 4.685812e+15
User Votes, Total Cast Likes, Director Facebook Likes, Critic Reviews
#Calling
userVotes = mydata$users_votes
TCL = mydata$total_cast_likes
DFL = mydata$director_fb_likes
criticReviews = mydata$critic_reviews
#Max
max_userVotes = max(userVotes, na.rm = TRUE)
max_TCL = max(TCL, na.rm = TRUE)
max_DFL = max(DFL, na.rm = TRUE)
max_criticReviews = max(criticReviews, na.rm = TRUE)
max_userVotes
[1] 1689764
max_TCL
[1] 656730
max_DFL
[1] 23000
max_criticReviews
[1] 813
#Min
min_userVotes = min(userVotes, na.rm = TRUE)
min_TCL = min(TCL, na.rm = TRUE)
min_DFL = min(DFL, na.rm = TRUE)
min_criticReviews = min(criticReviews, na.rm = TRUE)
min_userVotes
[1] 5
min_TCL
[1] 0
min_DFL
[1] 0
min_criticReviews
[1] 1
#Range
range_userVotes = max_userVotes - min_userVotes
range_TCL = max_TCL - min_TCL
range_DFL = max_DFL - min_DFL
range_criticReviews = max_criticReviews - min_criticReviews
range_userVotes
[1] 1689759
range_TCL
[1] 656730
range_DFL
[1] 23000
range_criticReviews
[1] 812
#Mean
mean_userVotes = mean(userVotes, na.rm = TRUE)
mean_TCL = mean(TCL, na.rm = TRUE)
mean_DFL = mean(DFL, na.rm = TRUE)
mean_criticReviews = mean(criticReviews, na.rm = TRUE)
mean_userVotes
[1] 83668.16
mean_TCL
[1] 9699.064
mean_DFL
[1] 686.5092
mean_criticReviews
[1] 140.1943
#Standard Deviation
sd_userVotes = sd(userVotes, na.rm = TRUE)
sd_TCL = sd(TCL, na.rm = TRUE)
sd_DFL = sd(DFL, na.rm = TRUE)
sd_criticReviews = sd(criticReviews, na.rm = TRUE)
sd_userVotes
[1] 138485.3
sd_TCL
[1] 18163.8
sd_DFL
[1] 2813.329
sd_criticReviews
[1] 121.6017
#variance
var_userVotes = var(userVotes, na.rm = TRUE)
var_TCL = var(TCL, na.rm = TRUE)
var_DFL = var(DFL, na.rm = TRUE)
var_criticReviews = var(criticReviews, na.rm = TRUE)
var_userVotes
[1] 19178166353
var_TCL
[1] 329923599
var_DFL
[1] 7914818
var_criticReviews
[1] 14786.97
An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.
summary(mydata)
ï..title genres director actor1 actor2 actor3 length budget director_fb_likes
Ben-HurÃÂ : 3 Drama : 236 : 104 Robert De Niro : 49 Morgan Freeman : 20 : 23 Min. : 7.0 Min. :2.180e+02 Min. : 0.0
HalloweenÃÂ : 3 Comedy : 209 Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15 Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06 1st Qu.: 7.0
HomeÃÂ : 3 Comedy|Drama : 191 Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14 John Heard : 8 Median :103.0 Median :2.000e+07 Median : 49.0
King KongÃÂ : 3 Comedy|Drama|Romance: 187 Clint Eastwood : 20 J.K. Simmons : 31 : 13 Steve Coogan : 8 Mean :107.2 Mean :3.975e+07 Mean : 686.5
PanÃÂ : 3 Comedy|Romance : 158 Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11 Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07 3rd Qu.: 194.5
The Fast and the FuriousÃÂ : 3 Drama|Romance : 152 Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11 Jon Gries : 7 Max. :511.0 Max. :1.222e+10 Max. :23000.0
(Other) :5025 (Other) :3910 (Other) :4834 (Other) :4829 (Other) :4959 (Other) :4982 NA's :15 NA's :492 NA's :104
actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes fb_likes critic_reviews users_reviews users_votes score aspect_ratio gross year
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162 Min. :1916
1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0 1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988 1st Qu.:1999
Median : 988 Median : 595 Median : 371.5 Median : 3090 Median : 166 Median :110.0 Median : 156.0 Median : 34359 Median :6.600 Median : 2.35 Median : 25517500 Median :2005
Mean : 6560 Mean : 1652 Mean : 645.0 Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408 Mean :2002
3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0 3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438 3rd Qu.:2011
Max. :640000 Max. :137000 Max. :23000.0 Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847 Max. :2016
NA's :7 NA's :13 NA's :23 NA's :50 NA's :21 NA's :329 NA's :884 NA's :108
There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.
Now, we will produce a basic blot of the ‘gross’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.
plot(gross)
When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing sales.
#xlab labels the x axis, ylab labels the y axis
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")
There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.
Now, lets plot the sales graph, alongside radio, paper, and tv which you will code. Make sure to run the code in the same chunk so they are on the same layout.
#Layout allows us to see all 4 graphs on one screen
layout(matrix(1:5,1,2,2))
data length [5] is not a sub-multiple or multiple of the number of columns [2]
#Example of how to plot the sales variable
plot(gross, type="b", xlab = "Case Number", ylab = "Gross in $1,000")
#Plot of User Votes
plot(userVotes, type="b", xlab = "Case Number", ylab = "User Votes")
#Plot of Total Cast Likes
plot(TCL, type="b", xlab = "Case Number", ylab = "Total Cast Likes")
#Plot of Director Facebook Likes
plot(DFL, type="b", xlab = "Case Number", ylab = "Director Facebook Likes")
#Plot of Critic Reviews
plot(criticReviews, type="b", xlab = "Case Number", ylab = "Critic Reviews")
The 20 months of case_number are in no particular order and not related to a chronological time sequence. They are simply 20 independent use case studies. Since each case is independent, we can reorder them. To reveal a potential trend, consider reordering the gross column from low to high and see how the other four variables behave.
newdata = mydata[order(gross),]
newGross = newdata$gross
newUserVotes = newdata$users_votes
newTCL = newdata$total_cast_likes
newDFL = newdata$director_fb_likes
newCriticReviews = newdata$critic_reviews
#Layout allows us to see all 5 graphs on one screen
layout(matrix(1:5,1,2,2))
data length [5] is not a sub-multiple or multiple of the number of columns [2]
plot(newGross, type="b", xlab = "Case Number", ylab = "Sales in $1,000")
plot(newUserVotes, type="b", xlab = "Case Number", ylab = "User Votes")
plot(newTCL, type="b", xlab = "Case Number", ylab = "Total Cast Likes")
plot(newDFL, type="b", xlab = "Case Number", ylab = "Director Facebook Likes")
plot(newCriticReviews, type="b", xlab = "Case Number", ylab = "Critic Reviews")
Given a sales value of 10,214,013, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.
We know that the z-score = (x - mean)/sd. So, input this into the R code where x=10214013, mean=48468408, and stdev = 68452990 which we found above.
Based on the z-values, how would you rate a $10214013 gross value: poor, average, good, or very good performance? Explain your logic.
gross_zscore = (10214013-48468408)/68452990
gross_zscore
[1] -0.5588418
The gross’ Z-Score of -0.559 is a poor score. The -0.559 represents a place between 0 (exactly average) and -1 (in the bottom ~16% which is poor) but is closer to -1 so the percentile is probably around the bottom 30th percentile, which is obviously bad.