Data604 - Homework#3

Problem 1

In this first problem we will similuate the through a of 2 dice and take the sum of the throws. We will run this simulation with 3 distincts scenarios;
1. 50 throws of a pair of fair dice
2. 500 throws of a pair of fair dice
3. 50 throws of a pair of loaded dice
4. 10,000 throws of a pair of fair dice

We will performed the calculations in R.

f_throw_die <- function(prob, t){
  
  # assign face of dice based on input probability
  if (t<=prob[1]){
    face<-1
  } else if (t<=prob[2]){
    face<-2
  } else if (t<=prob[3]){
    face<-3
  } else if (t<=prob[4]){
    face<-4
  } else if (t<=prob[5]){
    face<-5
  } else if (t<=prob[6]){
    face<-6
  }
  return (face)
}

f_game <- function(prob1, prob2, n){
  # function to simulate a games of dice
  
  # initialize empty vector to store results of throws
  die1 <- numeric()
  die2 <- numeric()
  
  # Generate uniformely distributed numbers one for each throw 
  die1_t <- runif(n, 0, 1)
  die2_t <- runif(n, 0, 1)
  
  #execute as many throws as necessary of each die
  
  for (i in 1:n){
    die1_f <- f_throw_die(prob1, die1_t[i])
    die2_f <- f_throw_die(prob2, die2_t[i])
    
    die1 <- c(die1, die1_f)
    die2 <- c(die2, die2_f)
    
  }
  
  df_dice <- data.frame(cbind(die1,die2))
  
  # force Column name of dataframe
  colnames <- c('die1', 'die2')
  
  # create sum
  df_dice$sum <- df_dice$die1 + df_dice$die2
  
  return(df_dice)
}

Scenario 1: 50 throws fair dice

The first scenario is to play with a pair of fair dice and throw 50 times.

# Build a probability for dice
fair_dice <- c(1/6, 2/6, 3/6, 4/6, 5/6, 6/6)

# we will through the dice 50 times
throws <- 50

# we will store the results in a dataframe
df_game1 <- f_game(fair_dice, fair_dice, throws)

We will now plot the histogram of the results.

library(ggplot2)

g1 <- ggplot(data=df_game1, aes(df_game1$sum)) + geom_histogram()

g1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The probability distribution of the sum of throwing a pair of fair dice is well known and we would expect a normal distribution. We can see from histogram above it approximate a normal distribution. We would have better approximation with a higher number of throws.

Let us repeat with experiment with 500 throws.

Scenario 2: 500 throws fair dice

Let us now repeating the exercise of throwing a pair of fair dice, this time we will simulate 500 throws. Since we will again use fair dice, we will use the probability of 1/6 for the probability outcome of each face of the dice.

# we will through the dice 500 times
throws <- 500

# we will store the results in a dataframe
df_game2 <- f_game(fair_dice, fair_dice, throws)

We will now plot the histogram of the results.

g2 <- ggplot(data=df_game2, aes(df_game2$sum)) + geom_histogram()

g2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The resulting histogram shows a normal distribution centered around the value 7 which is as we would expect.

Scenario 3: 50 throws unfair dice

For this experiment, we will make both dice unfair by changing the probability of the outcome of for each face of the die. We will favor the lower numbers 1, 2, 3. We will do so by assigning the following probabilies.

Face	Probability
1	3/12
2	3/12
3	3/12
4	1/12
5	1/12
6	1/12

Note: The sum of the probabilities must equal to 1

# Build a probability for dice
loaded_dice <- c(3/12, 6/12, 9/12, 10/12, 11/12, 12/12)

# we will through the dice 50 times
throws <- 50

# we will store the results in a dataframe
df_game3 <- f_game(loaded_dice, loaded_dice, throws)

We will now plot the histogram of the results. Based on the assign probabilies, we will expect an asymetric distribution skewed to the right.

g3 <- ggplot(data=df_game3, aes(df_game3$sum)) + geom_histogram()

g3

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram result, we can observe that the most prevelant outcome has shifted to a lower value (< 7) as would have been expected.

Scenario 4: 10,000 throws fair dice

We will now repeat the experiment with fair dice and 10,000 throws. We will compare our results with the expected outcome and true probabilities of the outcomes and the true expected values of 7.

The true probabilities distribution of a sum of 2 fair dices is given in the tabble below:

Sum	2	3	4	5	6	7	8	9	10	11	12
Probabilities	1/36	2/36	3/36	4/36	5/36	6/36	5/36	4/36	3/36	2/36	1/36

# we will through the dice 50 times
throws <- 10000

# we will store the results in a dataframe
df_game4 <- f_game(fair_dice, fair_dice, throws)

we will now compute the probabilities matching our outcomes. We will do so may counting the outcome for each possible sum and dividing by the number of throws. We will then compare with the true probabilities. We will compare by calculating the error: experiment result - true probabilities.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df_prob <- df_game4 %>% arrange(sum) %>% group_by(sum) %>% summarise(count = n())

df_prob$calc_prob <- df_prob$count/throws

true_prob <- c(1/36, 2/36,  3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36)

df_prob <- cbind.data.frame(df_prob, true_prob)

df_prob$error <- df_prob$calc_prob - df_prob$true_prob

We will now plot the errors for each sum outcome.

g5 <- ggplot(df_prob, aes(x=sum, y=error)) + geom_point(shape=1) + geom_hline(yintercept = 0, color = 'blue')   

g5

We can see from the results that the errors are quite small.

We will now calculate the expected value for our results and compare to the true expected value of 7.

calc_EV = sum(df_prob$sum*df_prob$calc_prob)

true_EV = sum(df_prob$sum*df_prob$true_prob)

diff_EV = calc_EV - true_EV

The Exptected value for our results is 7.0107. The true Expected value is 7. The difference of result vs true is 0.0107.

Problem 5

We will performed the Monte Carlo Integration for the following function h(x) to be the probability density function of a normal variable with mean \(\mu\) and standard deviation \(\sigma\) > 0, which is given by the following:

\(h(x)\quad =\quad { \Phi }_{ \mu ,\sigma }(x)\quad =\quad \frac { 1 }{ \sigma \sqrt { 2\pi } } \cdot { e }^{ \frac { { -(x-\mu ) }^{ 2 } }{ 2{ \sigma }^{ 2 } } }\)

Let us look at the case where;
\(\mu =5.8\)
\(\sigma =2.3\)

The Monte Carlo Integration method provide a way to estimate the integral of a function h(x) to E(f(x)) provided that h(x)=f(x)p(x) where p(x) is a probability density function.

In the case above, we will estimate the integral of h(x) stated above over the interval [a,b] where a = 4.5 and b=6.7

Let p(x) a uniform distribution function such as

\(p(x)=\begin{cases} \frac { 1 }{ (b-a) } ,\quad when\quad a\le x\le b \\ 0\quad ,\quad otherwise \end{cases}\)

Hence we have:

\(\int _{ a }^{ b }{ h(x)dx\quad =\quad \int _{ a }^{ b }{ f(x)\cdot \frac { 1 }{ (b-a) } } dx } \quad =\quad E\left[ f(x) \right]\)

and

\(f(x)\quad =\quad (b-a)h(x)\)

Let N be the number of random variables X,

\(E\left[ f(x) \right] \quad \approx \quad \frac { 1 }{ N } \sum _{ i=1 }^{ N }{ { X }_{ i } }\)

The calculations will be done in R.

N <- 50  
a <- 4.5
b <- 6.7

mu <- 5.8
sigma <- 2.3

f_x <- function(x, a, b, mu, sigma){
  # functioln to calculate f(x)
  q <- 1/(sqrt(2*pi*(sigma^2)))
  
  expo <- -((x-mu)^2)/(2*(sigma^2))
  result <- (b-a)*q*exp(expo)
  
  return(result)
}

X <- runif(N, a, b)

Y <- sapply(X, f_x, a=a, b=b, mu=mu, sigma=sigma)

mu_est <- mean(Y)

#Actual value for this we will use cdf for normal distribution to find value area between a and b
actual_mu <- pnorm(b, mean=mu, sd=sigma)-pnorm(a,mean=mu, sd=sigma)

sigma_est <- sd(Y)

# 95% Confidence Interval

error_95 <- qnorm(0.975)*sigma_est/sqrt(N)

left_95 <- mu_est-error_95
right_95 <- mu_est+error_95

left_95

## [1] 0.3629541

right_95

## [1] 0.3712566

# 90% confidence Interval  

error_90 <- qnorm(0.95)*sigma_est/sqrt(N)

left_90 <- mu_est-error_90
right_90 <- mu_est+error_90

left_90

## [1] 0.3636215

right_90

## [1] 0.3705891

# 99% confidence Interval  

error_99 <- qnorm(0.995)*sigma_est/sqrt(N)

left_99 <- mu_est-error_99
right_99 <- mu_est+error_99

left_99

## [1] 0.3616497

right_99

## [1] 0.372561

actual_mu

## [1] 0.3662509

Run the estimation a few times and the actual value is not always in the 90% confidence interval.

Reference for this problem beyond text book:
https://www.youtube.com/watch?v=MKnjsqYVG4Y

http://www.math.wichita.edu/~cma/stat774/ch3.pdf

Problem 17

This problem was originally done in Excell part of discussion 3.