Simulation in R

In class we have talked about computational methods and introduced the idea of modeling and simulation. We discussed how many levels of uncertainty and complex interactions make systems challenging to predict and understand.

One approach to deal with uncertainty, such as in the expected cost example, is to run repeated simulations, tallying outcomes as we go. In our examples, the computer is making repeated random draws from our simulated distribution of sales. Based on each draw we program our model to calculate the total cost using the expressions we formalized in the preceding decision tree. After many draws, we could draw a histogram (risk profile) of the cost figures and calculate the average cost of going with each supplier.

We simulate, because we can then compare them and make our decision based on which one is likely to cost us less on average according to these simulations. It is important to note that these estimations are anything but certain, and are only going to approximate the real-world closely if the data we use to generate them mirrors the real world.

Assignment items: Part I: - Make a Monte Carlo simulation (code an R function and run it multiple times) - Determine which actor is better (t-test for your outputs).

Part II: - Modify your function (add another argument: Genre) - Simulate with Genre and Actor (determine which is the best actor/actress for a Comedy movie)

Submit by Sunday 5/17 11:59 pm EST - Canvas upload.

Good luck!

Movie production hollywood star allocation: A Monte Carlo simulation

Situation: You are a big hollywood producer. You receive two exceptional scripts and among the board of executives the decision is to make both into movies. One is a Comedy script, while the other is an Action movie. You have enough money to fund both productions, but there is a scarce resource: there are only a handfull of hollywood stars that could play the lead actor/actress.

Question: How many more viewers can we expect if we allocate different actors/actress?

Resources: Your analytics time has provided you with critical insights of your consumers. We have their preferences (rating) over the movie stars. Load the UserActorRating.csv to your Rstudio.

#First will create a blank data frame with space for 100k observations:
UserData <- read.csv("C:/Users/Manisha/Downloads/UserActorRating.csv")

str(UserData)
## 'data.frame':    249 obs. of  5 variables:
##  $ User    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Actor1  : int  5 2 5 1 5 3 5 3 5 4 ...
##  $ Actress2: int  5 1 4 3 4 2 3 2 5 4 ...
##  $ Actor3  : int  4 4 2 3 2 1 4 3 1 2 ...
##  $ Genre   : chr  "C" "A" "C" "A" ...

Part I - Modeling User utility from our movies

First, let us estimate if our consumers will go watch the possible movies or not. We have a simple model of User Behavior.

If the User likes the actor/actress with a rating of 4 or 5, then the User likes the movie. So, whenever the rating for a specific actor is 4 or higher, your function should return 1 (if they will watch the movie) or 0 (if they won’t).

Using the “ifelse(condition,consequence if true, consequence if false)” function, model this user behavior.

watchMovie <- function(actorRating) {  # This tells R what we will call our function and the parameters that it needs to estimate
               estimate = ifelse(actorRating > 4,
                                 1,
                                 0
                                 )
               return(estimate)
               
               # Your function should return 1 if they will watch the movie, 0 if they will not.
}

#Try your function with this list of 3 values.

watchMovie(list(1,3,5))
## [1] 0 0 1
#The output should be:  0  0  1

#As only the third element would be watched by our user model.

Ok. Now, run this function in 100 thousand simulations!!

#First will create a blank data frame with space for 100k observations:
simulations <- data.frame(matrix( , nrow = 100000, ncol = 0))

#Then we simulate 100k different estimations of watching.

#We do this for all three actors. We randomly sample our user model many times.

simulations$watchedA1 <- watchMovie(sample(UserData$Actor1, 100000, replace = TRUE))

simulations$watchedA2 <- watchMovie(sample(UserData$Actress2, 100000, replace = TRUE))
simulations$watchedA3 <- watchMovie(sample(UserData$Actor3, 100000, replace = TRUE))

#The mean of our simulations is below. Do the same for all actors.
mean(simulations$watchedA1)
## [1] 0.25976
mean(simulations$watchedA2)
## [1] 0.19185
mean(simulations$watchedA3)
## [1] 0.20192
sapply(simulations,mean)
## watchedA1 watchedA2 watchedA3 
##   0.25976   0.19185   0.20192

Alright, so we have our simulations results and now we want to make a decision based on this estimations. Now that we have our simulations, we can compare them a few different ways. We could compare average movie-goers among the simulations.

To do this, we can do a t-test in pairs.

t.test(simulations$watchedA1, simulations$watchedA2)
## 
##  Welch Two Sample t-test
## 
## data:  simulations$watchedA1 and simulations$watchedA2
## t = 36.439, df = 197725, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06425722 0.07156278
## sample estimates:
## mean of x mean of y 
##   0.25976   0.19185
t.test(simulations$watchedA2, simulations$watchedA3)
## 
##  Welch Two Sample t-test
## 
## data:  simulations$watchedA2 and simulations$watchedA3
## t = -5.6631, df = 199923, p-value = 1.489e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.013555204 -0.006584796
## sample estimates:
## mean of x mean of y 
##   0.19185   0.20192
t.test(simulations$watchedA1, simulations$watchedA3)
## 
##  Welch Two Sample t-test
## 
## data:  simulations$watchedA1 and simulations$watchedA3
## t = 30.766, df = 198458, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05415526 0.06152474
## sample estimates:
## mean of x mean of y 
##   0.25976   0.20192

Determine which is the best actor given our model.

There is a statistically significant difference between the three means. Actor 1 appears to have the best average rating as per the results of the T-tests.

Part II - Multiple Levels of Uncertainty

Now, imagine that is not only preference for hollywood stars. Now we consider preference for the genre: Comedy or Action.

The problem is we don’t know much about our audience preference for these genres. We only know which one they like (not how much)

We can use a Triangle distribution to offer an estimation. We use a continous approximation by using a triangle distribution with minimum = 1, maximum = 5. Medians will depend on which genre they like.

If the user likes the genre type, it has a median = 4 (a higher probability of watching it). If the user does not like the gente, it has median = 2 (lower probability of watching).

First, modify your previous function. Provide another ifelse() loop that includes Genre (‘C’ or ‘A’). We will use the triangle distribution here.

library(triangle)
## Warning: package 'triangle' was built under R version 4.2.3
watchMovie <- function(actorRating,genre) {  # This tells R what we will call our function and the parameters that it needs to estimate
               estimate = ifelse(actorRating>3,
                                 ifelse(genre == 'C',round(rtriangle(1, 1, 5, 4)),round(rtriangle(1, 1, 5, 2))),
                                 0
                                 )
               return(estimate)
               
               # Have in consideration that this function only returns the expected rating of watching movies of genre C ('Comedy')
}

#Try your function with this list of 3 values for ActorRatings and Genre

watchMovie(list(1,3,5),list('C','A','A'))
## [1] 0 0 2
#The output should be something like:  0  0  3

#Expect values between 1 and 5 in the third element.

Alright, now simulate 100 thousands cases.

#First will create a blank data frame with space for 100k observations:
simulations <- data.frame(matrix( , nrow = 100000, ncol = 0))

#Then we simulate 100k different estimations of watching.

#We do this for all three actors. We randomly sample our user model many times.

simulations$watchedA1 <- watchMovie(sample(UserData$Actor1, 100000, replace = TRUE),sample(UserData$Genre, 100000, replace = TRUE))
simulations$watchedA2 <- watchMovie(sample(UserData$Actress2, 100000, replace = TRUE),sample(UserData$Genre, 100000, replace = TRUE))
simulations$watchedA3 <- watchMovie(sample(UserData$Actor3, 100000, replace = TRUE),sample(UserData$Genre, 100000, replace = TRUE))

#With this, we can get some idea of central tendency and variability:
print('Actor 1 results:')
## [1] "Actor 1 results:"
mean(simulations$watchedA1)
## [1] 0.91602
sd(simulations$watchedA1)
## [1] 0.9964724
print('Actress 2 results:')
## [1] "Actress 2 results:"
mean(simulations$watchedA2)
## [1] 1.41166
sd(simulations$watchedA2)
## [1] 1.786502
print('Actor 3 results:')
## [1] "Actor 3 results:"
mean(simulations$watchedA3)
## [1] 1.62766
sd(simulations$watchedA3)
## [1] 1.982232
sapply(simulations,mean)
## watchedA1 watchedA2 watchedA3 
##   0.91602   1.41166   1.62766
sapply(simulations,sd)
## watchedA1 watchedA2 watchedA3 
## 0.9964724 1.7865016 1.9822316

So, considering these simulations (Genre and Actors), which Actor would be better for Comedy? ## Considering the simulations, Actor 2 would be better for comedy.

Additional

We can algo get a visual representation of the probabilities that we are interested on:

hist(simulations$watchedA2,probability=TRUE)
lines(density(simulations$watchedA2),col="red")

plot(ecdf(simulations$watchedA2))

We could then calculate the likelihood of the total time being some number or less using the CDF of our simulation. That is we could determine the 75%, 85%, 95% probability that the total time would be X or less.

cdf = ecdf(simulations$watchedA1)
quantile(cdf,.75) 
## 75% 
##   2
      #2606 ~ 75% of the time we would expect the total time to be 2605 hours or less
quantile(cdf,.85) 
## 85% 
##   2
      #2822 ~ 85% of the time we would expect the total time to be 2821 hours or less
quantile(cdf,.95) 
## 95% 
##   2
      #3140 ~ 95% of the time we would expect the total time to be 3141 hours or less