Simulation in R

In class we have talked about computational methods and introduced the idea of modeling and simulation. We discussed how many levels of uncertainty and complex interactions make systems challenging to predict and understand.

One approach to deal with uncertainty, such as in the expected cost example, is to run repeated simulations, tallying outcomes as we go. In our examples, the computer is making repeated random draws from our simulated distribution of sales. Based on each draw we program our model to calculate the total cost using the expressions we formalized in the preceding decision tree. After many draws, we could draw a histogram (risk profile) of the cost figures and calculate the average cost of going with each supplier.

We simulate, because we can then compare them and make our decision based on which one is likely to cost us less on average according to these simulations. It is important to note that these estimations are anything but certain, and are only going to approximate the real-world closely if the data we use to generate them mirrors the real world.

Assignment items: Part I: - Make a Monte Carlo simulation (code an R function and run it multiple times) - Determine which actor is better (t-test for your outputs).

Part II: - Modify your function (add another argument: Genre) - Simulate with Genre and Actor (determine which is the best actor/actress for a Comedy movie)

Submit by Sunday 5/17 11:59 pm EST - Canvas upload.

Good luck!

Movie production hollywood star allocation: A Monte Carlo simulation

Situation: You are a big hollywood producer. You receive two exceptional scripts and among the board of executives the decision is to make both into movies. One is a Comedy script, while the other is an Action movie. You have enough money to fund both productions, but there is a scarce resource: there are only a handfull of hollywood stars that could play the lead actor/actress.

Question: How many more viewers can we expect if we allocate different actors/actress?

Resources: Your analytics time has provided you with critical insights of your consumers. We have their preferences (rating) over the movie stars. Load the UserActorRating.csv to your Rstudio.

#First will create a blank data frame with space for 100k observations:
UserData <- read.csv("~/Documents/HU/ANLY 525-50-B/Week 9 - Computational Methods - Part I | Tuesday 5:12/UserActorRating.csv")

str(UserData)
## 'data.frame':    249 obs. of  5 variables:
##  $ User    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Actor1  : int  5 2 5 1 5 3 5 3 5 4 ...
##  $ Actress2: int  5 1 4 3 4 2 3 2 5 4 ...
##  $ Actor3  : int  4 4 2 3 2 1 4 3 1 2 ...
##  $ Genre   : Factor w/ 2 levels "A","C": 2 1 2 1 2 1 1 1 2 2 ...
summary(UserData[-1])
##      Actor1         Actress2         Actor3      Genre  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   A:151  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   C: 98  
##  Median :3.000   Median :3.000   Median :3.000          
##  Mean   :3.133   Mean   :3.016   Mean   :3.024          
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000          
##  Max.   :5.000   Max.   :5.000   Max.   :5.000

Part I - Modeling User utility from our movies

First, let us estimate if our consumers will go watch the possible movies or not. We have a simple model of User Behavior.

If the User likes the actor/actress with a rating of 4 or 5, then the User likes the movie. So, whenever the rating for a specific actor is 4 or higher, your function should return 1 (if they will watch the movie) or 0 (if they won’t).

Using the “ifelse(condition,consequence if true, consequence if false)” function, model this user behavior.

watchMovie <- function(actorRating) {  # This tells R what we will call our function and the parameters that it needs to estimate
               estimate = ifelse(actorRating %in% c(4,5),
                                 1,
                                 0
                                 )
               return(estimate)
               
               # Your function should return 1 if they will watch the movie, 0 if they will not.
}

#Try your function with this list of 3 values.

watchMovie(list(1,3,5))
## [1] 0 0 1
#The output should be:  0  0  1

#As only the third element would be watched by our user model.

Ok. Now, run this function in 100 thousand simulations!!

#First will create a blank data frame with space for 100k observations:
simulations <- data.frame(matrix( , nrow = 100000, ncol = 0))

#Then we simulate 100k different estimations of watching.

#We do this for all three actors. We randomly sample our user model many times.

set.seed(525)
simulations$watchedA1 <- watchMovie(sample(UserData$Actor1, 100000, replace = TRUE))
simulations$watchedA2 <- watchMovie(sample(UserData$Actress2, 100000, replace = TRUE))
simulations$watchedA3 <- watchMovie(sample(UserData$Actor3, 100000, replace = TRUE))

#The mean of our simulations is below. Do the same for all actors.
#mean(simulations$watchedA1)
sapply(simulations, mean)
## watchedA1 watchedA2 watchedA3 
##   0.45493   0.39191   0.42861

Alright, so we have our simulations results and now we want to make a decision based on this estimations. Now that we have our simulations, we can compare them a few different ways. We could compare average movie-goers among the simulations.

To do this, we can do a t-test in pairs.

t.test(simulations$watchedA1, simulations$watchedA2)
## 
##  Welch Two Sample t-test
## 
## data:  simulations$watchedA1 and simulations$watchedA2
## t = 28.578, df = 199919, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05869786 0.06734214
## sample estimates:
## mean of x mean of y 
##   0.45493   0.39191
t.test(simulations$watchedA1, simulations$watchedA3)
## 
##  Welch Two Sample t-test
## 
## data:  simulations$watchedA1 and simulations$watchedA3
## t = 11.855, df = 199990, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02196869 0.03067131
## sample estimates:
## mean of x mean of y 
##   0.45493   0.42861
t.test(simulations$watchedA2, simulations$watchedA3)
## 
##  Welch Two Sample t-test
## 
## data:  simulations$watchedA2 and simulations$watchedA3
## t = -16.695, df = 199961, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04100849 -0.03239151
## sample estimates:
## mean of x mean of y 
##   0.39191   0.42861

Determine which is the best actor given our model.
Welch’s t-test has a null hypothesis (\(H_0\)) that two populations have equal means. The three p-values are not statistically significant at an alpha level of 0.05, hence reject the \(H_0\). In other words, two actors and one actress have different average ratings. Actor1 has the best average rating in simulation.

Part II - Multiple Levels of Uncertainty

Now, imagine that is not only preference for hollywood stars. Now we consider preference for the genre: Comedy or Action.

The problem is we don’t know much about our audience preference for these genres. We only know which one they like (not how much)

We can use a Triangle distribution to offer an estimation. We use a continous approximation by using a triangle distribution with minimum = 1, maximum = 5. Medians will depend on which genre they like.

If the user likes the genre type, it has a median = 4 (a higher probability of watching it). If the user does not like the genre, it has median = 2 (lower probability of watching).

First, modify your previous function. Provide another ifelse() loop that includes Genre (‘C’ or ‘A’). We will use the triangle distribution here.

if(!require(triangle)) {install.packages("triangle"); library(triangle)}
## Loading required package: triangle
watchMovie <- function(actorRating,genre) {  # This tells R what we will call our function and the parameters that it needs to estimate
               estimate = ifelse(actorRating>3,
                                 ifelse(genre == 'C',round(rtriangle(1, 1, 5, 4)),round(rtriangle(1, 1, 5, 2))),
                                 0
                                 )
               return(estimate)
               
               # Have in consideration that this function only returns the expected rating of watching movies of genre C ('Comedy')
}

#Try your function with this list of 3 values for ActorRatings and Genre

watchMovie(list(1,3,5),list('C','A','A'))
## [1] 0 0 2
#The output should be something like:  0  0  3

#Expect values between 1 and 5 in the third element.

Alright, now simulate 100 thousands cases.

#First will create a blank data frame with space for 100k observations:
simulations <- data.frame(matrix( , nrow = 100000, ncol = 0))

#Then we simulate 100k different estimations of watching.

#We do this for all three actors. We randomly sample our user model many times.

set.seed(525)
simulations$watchedA1 <- watchMovie(sample(UserData$Actor1, 100000, replace = TRUE),sample(UserData$Genre, 100000, replace = TRUE))
simulations$watchedA2 <- watchMovie(sample(UserData$Actress2, 100000, replace = TRUE),sample(UserData$Genre, 100000, replace = TRUE))
simulations$watchedA3 <- watchMovie(sample(UserData$Actor3, 100000, replace = TRUE),sample(UserData$Genre, 100000, replace = TRUE))

#With this, we can get some idea of central tendency and variability:
#mean(simulations$watchedA1)
sapply(simulations, mean)
## watchedA1 watchedA2 watchedA3 
##   1.08751   1.33534   1.18840
#sd(simulations$watchedA1)
sapply(simulations, sd)
## watchedA1 watchedA2 watchedA3 
##  1.235033  1.687043  1.517356

So, considering these simulations (Genre and Actors), which Actor would be better for Comedy?
Actress2 has the best average rating for the comedy gengre in simulation.

Additional

We can algo get a visual representation of the probabilities that we are interested on:

hist(simulations$watchedA2,probability=TRUE)
lines(density(simulations$watchedA2),col="red")

plot(ecdf(simulations$watchedA2))

We could then calculate the likelihood of the total time being some number or less using the CDF of our simulation. That is we could determine the 75%, 85%, 95% probability that the total time would be X or less.

cdf = ecdf(simulations$watchedA2)
quantile(cdf,.75) 
## 75% 
##   3
      #2606 ~ 75% of the time we would expect the total time to be 2605 hours or less
quantile(cdf,.85) 
## 85% 
##   4
      #2822 ~ 85% of the time we would expect the total time to be 2821 hours or less
quantile(cdf,.95) 
## 95% 
##   4
      #3140 ~ 95% of the time we would expect the total time to be 3141 hours or less