First begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure we see it being read in correctly.
mydata = read.csv(file="data/rottentomatoes.csv")
head(mydata)
## title
## 1 AvatarÂÂ
## 2 Pirates of the Caribbean: At World's EndÂÂ
## 3 SpectreÂÂ
## 4 The Dark Knight RisesÂÂ
## 5 Star Wars: Episode VII - The Force AwakensÂÂ
## 6 John CarterÂÂ
## genres director actor1
## 1 Action|Adventure|Fantasy|Sci-Fi James Cameron CCH Pounder
## 2 Action|Adventure|Fantasy Gore Verbinski Johnny Depp
## 3 Action|Adventure|Thriller Sam Mendes Christoph Waltz
## 4 Action|Thriller Christopher Nolan Tom Hardy
## 5 Documentary Doug Walker Doug Walker
## 6 Action|Adventure|Sci-Fi Andrew Stanton Daryl Sabara
## actor2 actor3 length budget director_fb_likes
## 1 Joel David Moore Wes Studi 178 237000000 0
## 2 Orlando Bloom Jack Davenport 169 300000000 563
## 3 Rory Kinnear Stephanie Sigman 148 245000000 0
## 4 Christian Bale Joseph Gordon-Levitt 164 250000000 22000
## 5 Rob Walker NA NA 131
## 6 Samantha Morton Polly Walker 132 263700000 475
## actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes
## 1 1000 936 855 4834
## 2 40000 5000 1000 48350
## 3 11000 393 161 11700
## 4 27000 23000 23000 106759
## 5 131 12 NA 143
## 6 640 632 530 1873
## fb_likes critic_reviews users_reviews users_votes score aspect_ratio
## 1 33000 723 3054 886204 7.9 1.78
## 2 0 302 1238 471220 7.1 2.35
## 3 85000 602 994 275868 6.8 2.35
## 4 164000 813 2701 1144337 8.5 2.35
## 5 0 NA NA 8 7.1 NA
## 6 24000 462 738 212204 6.6 2.35
## gross year
## 1 760505847 2009
## 2 309404152 2007
## 3 200074175 2015
## 4 448130642 2012
## 5 NA NA
## 6 73058679 2012
Now calculate the Range, Min, Max, Mean, STDEV, and Variance for each variable. Below is an example of how to compute the items for the variable ‘sales’. Follow the example and do the same for radio, paper, tv, and pos.
summary(mydata)
## title genres
## Ben-Hur : 3 Drama : 236
## Halloween : 3 Comedy : 209
## Home : 3 Comedy|Drama : 191
## King Kong : 3 Comedy|Drama|Romance: 187
## Pan : 3 Comedy|Romance : 158
## The Fast and the Furious : 3 Drama|Romance : 152
## (Other) :5025 (Other) :3910
## director actor1 actor2
## : 104 Robert De Niro : 49 Morgan Freeman : 20
## Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15
## Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14
## Clint Eastwood : 20 J.K. Simmons : 31 : 13
## Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11
## Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11
## (Other) :4834 (Other) :4829 (Other) :4959
## actor3 length budget
## : 23 Min. : 7.0 Min. :2.180e+02
## Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06
## John Heard : 8 Median :103.0 Median :2.000e+07
## Steve Coogan : 8 Mean :107.2 Mean :3.975e+07
## Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07
## Jon Gries : 7 Max. :511.0 Max. :1.222e+10
## (Other) :4982 NA's :15 NA's :492
## director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
## Median : 49.0 Median : 988 Median : 595 Median : 371.5
## Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
## 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
## Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
## NA's :104 NA's :7 NA's :13 NA's :23
## total_cast_likes fb_likes critic_reviews users_reviews
## Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0
## Median : 3090 Median : 166 Median :110.0 Median : 156.0
## Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8
## 3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0
## Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0
## NA's :50 NA's :21
## users_votes score aspect_ratio gross
## Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162
## 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988
## Median : 34359 Median :6.600 Median : 2.35 Median : 25517500
## Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408
## 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438
## Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847
## NA's :329 NA's :884
## year
## Min. :1916
## 1st Qu.:1999
## Median :2005
## Mean :2002
## 3rd Qu.:2011
## Max. :2016
## NA's :108
gross = mydata$gross
gross = gross[!is.na(gross)]
max_gross = max(gross)
max_gross
## [1] 760505847
min_gross = min(gross)
min_gross
## [1] 162
max_gross-min_gross
## [1] 760505685
mean_gross = mean(gross)
mean_gross
## [1] 48468408
sd_gross = sd(gross)
sd_gross
## [1] 68452990
var_gross = var(gross)
var_gross
## [1] 4.685812e+15
uservotes = mydata$users_votes
uservotes = uservotes[!is.na(uservotes)]
max_uservotes = max(uservotes)
max_uservotes
## [1] 1689764
min_uservotes = min(uservotes)
min_uservotes
## [1] 5
max_uservotes-min_uservotes
## [1] 1689759
mean_uservotes = mean(uservotes)
mean_uservotes
## [1] 83668.16
sd_uservotes = sd(uservotes)
sd_uservotes
## [1] 138485.3
var_uservotes = var(uservotes)
var_uservotes
## [1] 19178166353
totalcastlikes = mydata$total_cast_likes
totalcastlikes = totalcastlikes[!is.na(totalcastlikes)]
max_totalcastlikes = max(totalcastlikes)
max_totalcastlikes
## [1] 656730
min_totalcastlikes = min(totalcastlikes)
min_totalcastlikes
## [1] 0
max_totalcastlikes-min_totalcastlikes
## [1] 656730
mean_totalcastlikes = mean(totalcastlikes)
mean_totalcastlikes
## [1] 9699.064
sd_totalcastlikes = sd(totalcastlikes)
sd_totalcastlikes
## [1] 18163.8
var_totalcastlikes = var(totalcastlikes)
var_totalcastlikes
## [1] 329923599
directorFBlikes = mydata$director_fb_likes
directorFBlikes = directorFBlikes[!is.na(directorFBlikes)]
max_directorFBlikes = max(directorFBlikes)
max_directorFBlikes
## [1] 23000
min_directorFBlikes = min(directorFBlikes)
min_directorFBlikes
## [1] 0
max_directorFBlikes-min_directorFBlikes
## [1] 23000
mean_directorFBlikes = mean(directorFBlikes)
mean_directorFBlikes
## [1] 686.5092
sd_directorFBlikes = sd(directorFBlikes)
sd_directorFBlikes
## [1] 2813.329
var_directorFBlikes = var(directorFBlikes)
var_directorFBlikes
## [1] 7914818
criticreviews = mydata$critic_reviews
criticreviews = criticreviews[!is.na(criticreviews)]
max_criticreviews = max(criticreviews)
max_criticreviews
## [1] 813
min_criticreviews = min(criticreviews)
min_criticreviews
## [1] 1
max_criticreviews-min_criticreviews
## [1] 812
mean_criticreviews = mean(criticreviews)
mean_criticreviews
## [1] 140.1943
sd_criticreviews = sd(criticreviews)
sd_criticreviews
## [1] 121.6017
var_criticreviews = var(criticreviews)
var_criticreviews
## [1] 14786.97
An easy way to calculate all of these statistics of all of these variables is with the summary function. Below is an example.
summary(mydata)
## title genres
## Ben-Hur : 3 Drama : 236
## Halloween : 3 Comedy : 209
## Home : 3 Comedy|Drama : 191
## King Kong : 3 Comedy|Drama|Romance: 187
## Pan : 3 Comedy|Romance : 158
## The Fast and the Furious : 3 Drama|Romance : 152
## (Other) :5025 (Other) :3910
## director actor1 actor2
## : 104 Robert De Niro : 49 Morgan Freeman : 20
## Steven Spielberg: 26 Johnny Depp : 41 Charlize Theron: 15
## Woody Allen : 22 Nicolas Cage : 33 Brad Pitt : 14
## Clint Eastwood : 20 J.K. Simmons : 31 : 13
## Martin Scorsese : 20 Bruce Willis : 30 James Franco : 11
## Ridley Scott : 17 Denzel Washington: 30 Meryl Streep : 11
## (Other) :4834 (Other) :4829 (Other) :4959
## actor3 length budget
## : 23 Min. : 7.0 Min. :2.180e+02
## Ben Mendelsohn: 8 1st Qu.: 93.0 1st Qu.:6.000e+06
## John Heard : 8 Median :103.0 Median :2.000e+07
## Steve Coogan : 8 Mean :107.2 Mean :3.975e+07
## Anne Hathaway : 7 3rd Qu.:118.0 3rd Qu.:4.500e+07
## Jon Gries : 7 Max. :511.0 Max. :1.222e+10
## (Other) :4982 NA's :15 NA's :492
## director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
## Median : 49.0 Median : 988 Median : 595 Median : 371.5
## Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
## 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
## Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
## NA's :104 NA's :7 NA's :13 NA's :23
## total_cast_likes fb_likes critic_reviews users_reviews
## Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0
## Median : 3090 Median : 166 Median :110.0 Median : 156.0
## Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8
## 3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0
## Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0
## NA's :50 NA's :21
## users_votes score aspect_ratio gross
## Min. : 5 Min. :1.600 Min. : 1.18 Min. : 162
## 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85 1st Qu.: 5340988
## Median : 34359 Median :6.600 Median : 2.35 Median : 25517500
## Mean : 83668 Mean :6.442 Mean : 2.22 Mean : 48468408
## 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35 3rd Qu.: 62309438
## Max. :1689764 Max. :9.500 Max. :16.00 Max. :760505847
## NA's :329 NA's :884
## year
## Min. :1916
## 1st Qu.:1999
## Median :2005
## Mean :2002
## 3rd Qu.:2011
## Max. :2016
## NA's :108
There are some statistics not captured here like standard deviation and variance, but there is an easy and quick way to find most of your basic statistics.
Now, we will produce a basic blot of the ‘gross’ variable . Here we utilize the plot function and within the plot function we call the variable we want to plot.
plot(gross)
When looking at this graph we cannot truly capture the data or see a clear pattern. A better way to visualize this plot would be to re-order the data based on increasing sales.
plot(order(gross,decreasing = TRUE))
marketing = read.csv("data/marketing.csv")
sales = marketing$sales
radio = marketing$radio
tv = marketing$tv
paper = marketing$paper
plot(sales, type="b", xlab = "Case Number", ylab = "Sales in $1,000")
layout(matrix(1:4,2,2))
plot(sales, type="b", xlab = "Case Number", ylab = "Sales in $1,000")
plot(radio, type="b", xlab = "Case Number", ylab = "Sales in $1,000")
plot(paper, type="b", xlab = "Case Number", ylab = "Sales in $1,000")
plot(tv, type="b", xlab = "Case Number", ylab = "Sales in $1,000")
There are further ways to customize plots, such as changing the colors of the lines, adding a heading, or even making them interactive.
Now, lets plot the sales graph, alongside radio, paper, and tv which you will code. Make sure to run the code in the same chunk so they are on the same layout.
likes01 = mydata$actor1_fb_likes
likes02 = mydata$actor2_fb_likes
likes03 = mydata$actor3_fb_likes
likes04 = mydata$director_fb_likes
layout(matrix(1:4,2,2))
plot(likes01, type="b", xlab = "Actor 1", ylab = "Sales in $1,000")
plot(likes02, type="b", xlab = "Actor 2", ylab = "Sales in $1,000")
plot(likes03, type="b", xlab = "Actor 3", ylab = "Sales in $1,000")
plot(likes04, type="b", xlab = "Director", ylab = "Sales in $1,000")
The 20 months of case_number are in no particular order and not related to a chronological time sequence. They are simply 20 independent use case studies. Since each case is independent, we can reorder them. To reveal a potential trend, consider reordering the sales column from low to high and see how the other four variables behave.
newdata = mydata[order(year),]
year = mydata$year
newdata = mydata[order(year,decreasing = TRUE),]
new_gross = newdata$gross
new_likes1 = newdata$actor1_fb_likes
new_likes2 = newdata$actor2_fb_likes
new_likes3 = newdata$actor3_fb_likes
new_likes4 = newdata$director_fb_likes
plot(new_gross)
Given a sales value of 25000, calculate the corresponding z-value or z-score using the mean and standard deviation calculations conducted in task 1.
We know that the z-score = (x - mean)/sd. So, input this into the R code where x=25000, mean=16717.2, and stdev = 2617.0521 which we found above.
Based on the z-values, how would you rate a $25000 sales value: poor, average, good, or very good performance? Explain your logic. THE Z-SCORE IS 3.164. THE MEAN OF SALES IS 16717.2. A 25000 SALES VALUE IS A VERY GOOD PERFORMANCE. THE POSITIVE Z-SCORE INDICATES THE 25000 SALES VALUE IS OUTSIDE THE NORMAL DISTRIBUTION.
marketing
## case_number sales radio paper tv pos
## 1 1 11125 65 89 250 1.3
## 2 2 16121 73 55 260 1.6
## 3 3 16440 74 58 270 1.7
## 4 4 16876 75 82 270 1.3
## 5 5 13965 69 75 255 1.5
## 6 6 14999 70 71 255 2.1
## 7 7 20167 87 59 280 1.2
## 8 8 20450 89 65 280 3.0
## 9 9 15789 72 62 260 1.6
## 10 10 15991 73 56 260 1.6
## 11 11 15234 70 66 255 1.5
## 12 12 17522 78 50 270 0.0
## 13 13 17933 79 47 275 0.2
## 14 14 18390 81 78 275 0.9
## 15 15 18723 81 41 275 1.0
## 16 16 19328 84 63 280 2.6
## 17 17 19399 84 77 280 1.2
## 18 18 19641 85 35 280 2.5
## 19 19 12369 65 37 250 2.5
## 20 20 13882 68 80 252 1.4
mean_sales = mean(sales)
sd_sales = sd(sales)
mean_sales
## [1] 16717.2
sd_sales
## [1] 2617.052
zscore = (25000 - mean_sales)/sd_sales
zscore
## [1] 3.164935