Random Variables

Jonatan Sanchez & Erik Varga

dniGroup <- c(71810040,47278705)
sum(dniGroup)%%10 + 1

## [1] 6

Attention! you get this seed 139

#randvarOpAplus6
library(data.table)

Binomial

A Binomial distribution is defined as a number of successes in a sequence of independent Bernoulli trials.

\[ X = Bin(n,p) \\ n = \textrm{number of trials} \\ p = \textrm{probability of success} \\ P(x) = \binom{n}{x} p^x (1-p)^{n-x} \\ E[X] = np \\ V[X] = np(1-p) \\ Desv[X] = \sqrt{np(1-p)} \\ \]

A company called Birrus Mac has deployed a malicius worm on the net. The effects of that malware on the file systen is unknown. Suppose any possible PC can be infected. If the probability of an infection in any local PC is 0.525 and the probability that it does not corrupt your files is 0.48 . What do you thing will happen if your organization have 8 local PC computers, and the average number of files in there are 9 . Treat each situation as a separate problem

Could you please solve the following questions:

Variables

nFile <- 9 #number of files
nPc <- 8 #number of pc's
pC <- 1-0.48 #probability of file corruption
pInf <- 0.525 #probability of pc infection
nExp <- 10000 #number of experiments

1.Draw the pmf of the number of infected PC’s

plot(x= 0:8, y = dbinom(0:8,nPc,pInf), type="l", xlab = "x", ylab = "P(x)", main = "PMF")

2.What is the probability that we have between 2 and 6 PC’s infected ?

sum(dbinom(2:6,nPc,pInf))

## [1] 0.9269497

3.What is the probability that more than 5 files are corrupted ?

sum(dbinom(6:9,nFile,pC))

## [1] 0.2948105

4.Show the cdf of the number of infected PC’s

cdf <- pbinom(0:8,nPc,pInf)
plot(x= 0:8, y = cdf, type="l", xlab = "x", ylab = "P(x)", main = "CDF")

5.Show the cdf of the number of corrupted files

cdf <- pbinom(0:9,nFile,pC)
plot(x= 0:9, y = cdf, type="l", xlab = "x", ylab = "P(x)", main = "CDF")

6.Compute the Expected value, the Variance and the Deviation of the number of corrupted files. Use at least two different methods

set.seed(139)
Ex = nFile*pC
Ex

## [1] 4.68

Ex2 = mean(rbinom(nExp, nFile, pC))
Ex2

## [1] 4.688

Var = Ex*(1-pC)
Var

## [1] 2.2464

Var2 = var(rbinom(nExp, nFile, pC))
Var2

## [1] 2.268908

Dev = sqrt(Var)
Dev

## [1] 1.4988

Dev2 = sd(rbinom(nExp, nFile, pC))
Dev2

## [1] 1.502587

7.What is the probability that all the files are corrupted?

dbinom(nFile,nFile,pC)

## [1] 0.002779906

8.What is the probability that we have between 8 and 7 files’s not corrupted ?

sum(dbinom(7:8,nFile,1-pC))

## [1] 0.07033548

Geometric Distribution

A geometric distribution is defined as the number of trials until the first success is observed. Or in other words, the number of Bernoulli experiments needed to obtain the first sucessful outcome.

\[ X = Geom(p) \\ p = \textrm{probability of success} \\ P(x) = p (1-p)^{x-1} \textrm{ ; } x\geq 1 \\ E[X] = \frac{1}{p} \\ V[X] = \frac{1-p}{p^2} \\ Desv[X] = \sqrt{\frac{1-p}{p^2}} \\ \]

A group of coders are trying to prepare themselves to get a position at google. They already know no bugs permitted, at all. They can also decide to code with Python which google supports or Javi (a Java new implementation no one likes) What do you think will happen if the probability that you make a bug coding in Python is 0.525 and if you are programming in Javi the probability of writting goog code is 0.316 . You are coding different methods, programs and functions. Could you please solve the following questions if you need to write a good method or program in order to start interveiws with google:

Variables

pPython <- 0.525
pJavi <- 0.316

1.Compute the Expected value, the Variance and the Deviation of the number of attempts up to the 1st No Bug code (Python). Use at least two different methods

n <- 10000
mean(rgeom(n,1-pPython))

## [1] 1.1302

Ex <- 1/ (1-pPython)
Ex

## [1] 2.105263

Var <- (1-(1-pPython))/(pPython^2)
Var

## [1] 1.904762

Dev <-  sqrt(Var)
Dev

## [1] 1.380131

2.Draw the pmf of the number of attempts up to and including the first No Bug code (Python)

pmf <- dgeom(0:15,1-pPython)
plot(pmf, type="l", xlab = "x", ylab = "P(x)", main = "PMF")

3.Show the cdf of the number of attempts up to and including the first Bug code (Javi)

cdf <- pgeom(0:20,pJavi)
plot(cdf, type="l", xlab = "x", ylab = "P(x)", main = "CDF")

4.Compute the Expected value, the Variance and the Deviation of the number of attempts up to the 1st Bug code (Python). Use at least two different methods

Ex = 1/pPython
Ex

## [1] 1.904762

Var = (1-pPython)/pPython^2
Var

## [1] 1.723356

Dev = sqrt(Var)
Dev

## [1] 1.312767

5.Simulate the 1st No Bug code (Python) using 10000 experiments and show a table with the estimated PMF and CDF. Compare with the theoretical results

set.seed(139)
nExperiments <- 10000 
data.table(PMF=dgeom(rgeom(nExperiments,pPython),1-pPython),CDF=pgeom(rgeom(nExperiments,pPython),1-pPython))

##               PMF       CDF
##     1: 0.47500000 0.8552969
##     2: 0.47500000 0.4750000
##     3: 0.13092188 0.4750000
##     4: 0.47500000 0.8552969
##     5: 0.47500000 0.7243750
##    ---                     
##  9996: 0.24937500 0.7243750
##  9997: 0.47500000 0.4750000
##  9998: 0.47500000 0.7243750
##  9999: 0.47500000 0.4750000
## 10000: 0.06873398 0.4750000

6.Simulate the 1st No Bug code (Python) using 5000 experiments and show visually the estimated PMF and CDF. Compare with the theoretical results

set.seed(139)
nExperiments <- 5000 
pmf <- dgeom(rgeom(nExperiments,pPython),1-pPython)
cdf <- pgeom(rgeom(nExperiments,pPython),1-pPython)
hist(pmf,main = "PMF")

hist(cdf, main = "CDF")

7.Show the cdf of the number of attempts up to and including the first Bug code (Python)

cdf <- pgeom(0:15,pPython)
plot(cdf, type="l", xlab = "x", ylab = "P(x)", main = "CDF")

8.Simulate the 1st Bug code (Javi) using 2000 experiments and show visually the estimated PMF and CDF. Compare with the theoretical results

set.seed(139)
nExperiments <- 2000 
pmf <- dgeom(rgeom(nExperiments,pJavi),pJavi)
cdf <- pgeom(rgeom(nExperiments,pJavi),pJavi)
hist(pmf,main = "PMF")

hist(cdf, main = "CDF")

HyperGeometric Distribution

An Hypergeometric Distribution is defined as the number of success (without replacement) in our sample of size n

\[ X = HyperGeom(N,k,n) \\ N = \textrm{Total number of elements} \\ k = \textrm{successful elements} \\ n = \textrm{sample size} \\ \\ P(x) = \frac{ \binom{k}{x} \binom{N-k}{n-x} } {\binom{N}{n}} \\ \\ E[X] = \frac{nk}{N} \\ V[X] = \frac{k(N-k)n(N-n)}{N^2(N-1)} \\ Desv[X] = \sqrt{\frac{k(N-k)n(N-n)}{N^2(N-1)}} \\ \]

A big company has two main departments. The Engineers Department and the Economist Department. Between both departments there are 76 employees which can be rised to the Board Of Directors. However only 5 positions are available. If there are 39 Engineers.

Variables

total <- 76 
engineers <- 39 
economists <- total - engineers 
picks <- 5

1.What is the probability that we have between 4 and 23 Engineers ?

sum(dhyper(4:23, engineers, economists, picks))

## [1] 0.1958904

2.What is the probability that less than 5 are Economist

sum(dhyper(0:4, economists, engineers, picks))

## [1] 0.9764059

3.Plot the pmf of the number of Economist on the Board of Directors

pmf <- dhyper(0:5, economists, engineers, picks)
plot(pmf, type="l", xlab = "x", ylab = "P(x)", main = "PMF")

4.Compute the Expected value, the Variance and the Deviation of the number of Engineers on BOD. Use at least two different methods

set.seed(139)
Ex <- (picks*engineers)/total
Ex

## [1] 2.565789

Ex2 <- mean(rhyper(total,engineers,economists,picks))
Ex2

## [1] 2.565789

Var <- (engineers*(total-engineers)*picks*(total-picks))/(total^2*(total-1))
Var

## [1] 1.182514

Var2 <- var(rhyper(total,engineers,economists,picks))
Var2

## [1] 1.463158

Dev <- sqrt(Var)
Dev

## [1] 1.087435

Dev2 <- sqrt(Var2)
Dev2

## [1] 1.209611

5.Simulate the number of Engineers on BOD using 1e+05 experiments and show a table with the estimated PMF and CDF. Compare with the theoretical results

x<- rhyper(100000,engineers,economists,picks)
data.table(PMF=dhyper(x,engineers,economists,picks,log=FALSE),CDF=phyper(x,engineers,economists,picks,log=FALSE))

##               PMF       CDF
##      1: 0.3294521 0.8041096
##      2: 0.3294521 0.8041096
##      3: 0.3294521 0.8041096
##      4: 0.3116438 0.4746575
##      5: 0.3294521 0.8041096
##     ---                    
##  99996: 0.1394196 0.1630137
##  99997: 0.1394196 0.1630137
##  99998: 0.1394196 0.1630137
##  99999: 0.1394196 0.1630137
## 100000: 0.3116438 0.4746575

6.What is the probability that no Economist in the Board of Directors ?

dhyper(0,economists,engineers,picks)

## [1] 0.03116438

7.Draw the pmf of the number of Engineers on BOD

pmf <- dhyper(0:7, engineers, economists, picks)
plot(pmf, type="l", xlab = "x", ylab = "P(x)", main = "PMF")

8.What is the probability that exactly 2 are Engineers ?

dhyper(2, engineers, economists, picks)

## [1] 0.3116438

Poisson Distribution

A Poisson Distribution is defined as the number of success within a fixed period of time

\[ X = Pois(\lambda) \\ \lambda = \textrm{frequency, average number of events} \\ k = \textrm{successful elements} \\ \\ P(x) = e^{-\lambda}\frac{\lambda^x}{x!} \\ \\ E[X] = \lambda \\ V[X] = \lambda \\ Desv[X] = \sqrt{\lambda} \\ \]

App entertainment has upload a new App to Google Play and Apple Store. The expected number of payed downloads per minut is supposed to be 62

Variables

lambda <- 62

1. What is the probability that the exactly 6 downloads per day

dpois(6, lambda*60*24)

## [1] 0

2. What is the probability that we have between 25 and 58 downloads per day

sum(dpois(25:58, lambda*60*24))

## [1] 0

3. Compute the Expected value, the Variance and the Deviation of the number of downloads per minut . Use at least two different methods

Ex <- lambda
Ex

## [1] 62

Var <- lambda
Var

## [1] 62

Dev <- sqrt(lambda)
Dev

## [1] 7.874008

4. What is the probability that we have between 48 and 37 downloads per day

sum(dpois(37:48,lambda*60*24))

## [1] 0

5. What is the probability that we have between 17 and 50 downloads per minut

sum(dpois(17:50, lambda))

## [1] 0.06842627

6. Simulate the number of downloads per minut using 10000 experiments and show visually the estimated PMF and CDF. Compare with the theoretical results

p <- rpois(10000,lambda)
pmf <- dpois(p,lambda)
cdf <- ppois(p,lambda)
data.table(PMF=pmf,CDF=cdf)

##                PMF       CDF
##     1: 0.008433079 0.9722157
##     2: 0.049794530 0.5835022
##     3: 0.048238451 0.6317406
##     4: 0.042886917 0.2887104
##     5: 0.046012061 0.6777527
##    ---                      
##  9996: 0.049781576 0.4325123
##  9997: 0.050597668 0.5337076
##  9998: 0.050597668 0.5337076
##  9999: 0.023521314 0.1116751
## 10000: 0.027515500 0.1391906

hist(pmf)

hist(cdf)

7. What is the probability that more than 34 downloads per day

1-ppois(34, lambda*24*60)

## [1] 1

8. Plot the pmf of the number of downloads per day

pmf <- dpois(1:100000,lambda*60*24)
plot(1:100000,pmf,main = "pmf of number of downloads per day", xlab = "downloads",ylab ="P(x)")

Joint Distribution

the vector of joint probabilities

jointd <- matrix(c(.12 , .10, .03 , .10 , .09, .06 ,.13, .09, .05, .11, .08, .05), ncol=4, byrow = FALSE)
colnames <- c("A", "B", "C", "D")
rownames <- c("a", "b", "c")
jointd <- as.table(jointd)
jointd

##      A    B    C    D
## A 0.12 0.10 0.13 0.11
## B 0.10 0.09 0.09 0.08
## C 0.03 0.06 0.05 0.05

1. The marginal distribution of X1

mtable1 <- margin.table(jointd,1)
mtable1

##    A    B    C 
## 0.46 0.36 0.19

2. The marginal distribution of Y1

mtable2 <- margin.table(jointd,2)
mtable2

##    A    B    C    D 
## 0.25 0.25 0.27 0.24

3. The Expected value of X1

x1 <- 1*(0.12+0.1+0.13+0.11)
x1

## [1] 0.46

4. The Expected value of Y1

y1 <- 1*(0.12+0.1+0.03)
y1

## [1] 0.25

5. The Variance of X1

vx1 <- (1-x1)^2*(0.12+0.1+0.13+0.11)
vx1

## [1] 0.134136

6. The Variance of Y1

vy1 <- (1-x1)^2*(0.12+0.1+0.03)
vy1

## [1] 0.0729

Plots

plot(mtable1, type="l", xlab = "x", ylab = "P(x)", main = "Marginal Distribution") #marginal distribution of X1

plot(mtable2, type="l", xlab = "x", ylab = "P(x)", main = "Marginal Distribution") #marginal distribution of Y1

plot(x1) #Expected value of X1

plot(y1) #Expected value of Y1

plot(vx1) #Variance of X1

plot(vy1) #Variance of Y1

Joint Distribution 2

NASA and the Jet Propulsion Laboratory work on a new testing software called XX. It can detect problems with the code written by some software engineers. Nevertheless they use an old testing approach to detect coding errors called YY. Note the the number right to the parenthesis is the value of the random variable that accounts for the number of errors detected Can you try to solve the following questions?:

matriz <- matrix(c(.058,.060,.057,.064,.050,.006,.009,.006,.009,.006,.126,.130,.138,.140,.141),ncol=5,byrow = TRUE)
addmargins(matriz)

##                                     Sum
##     0.058 0.060 0.057 0.064 0.050 0.289
##     0.006 0.009 0.006 0.009 0.006 0.036
##     0.126 0.130 0.138 0.140 0.141 0.675
## Sum 0.190 0.199 0.201 0.213 0.197 1.000

1. What is the probability that the number of errors detected by YY is greater or equal than 63 given that the probability that the number of errors detected by XX is less or equal than 48

p <- 0
for(i in 2:3){
  for(j in 1:5){
    p <- p + matriz[i,j]
  }
}
p

## [1] 0.711

2. What is the probability that the number of errors detected by XX is greater or equal than 24

p <- 0
for(i in 1:3){
  p <- p + matriz[i,4] + matriz[i,5]
}
p

## [1] 0.41

3. What is the probability that the number of errors detected by XX is greater or equal than 24 given that the probability that the number of errors detected by YY is greater or equal than 55

p <- 0
for(i in 1:3){
  p <- p + matriz[i,4] + matriz[i,5]
}
p

## [1] 0.41

4. What is the probability that the number of errors detected by XX is less or equal than 9

p <- 0
for(i in 1:3){
  for(j in 1:3){
    p <- p + matriz[i,j]
  }
}
p

## [1] 0.59

5. What is the probability that the number of errors detected by XX is less or equal than 9 given that the probability that the number of errors detected by YY is greater or equal than 63

p <- 0
for(i in 2:3){
  for(j in 1:3){
    p <- p + matriz[i,j]
  }
}
p

## [1] 0.415

6. What is the probability that the number of errors detected by YY is less or equal than 65 given that the probability that the number of errors detected by XX is less or equal than 9

p <- 0
for(i in 1:3){
  for(j in 1:3){
    p <- p + matriz[i,j]
  }
}
p

## [1] 0.59

7. What is the probability that the number of errors detected by YY is less than 55

p <- 0
for(i in 1:5){
  p <- p + matriz[1,i]
}
p

## [1] 0.289

8. What is the probability that the number of errors detected by XX is greater than 24

p <- 0
for(i in 1:3){
  p <- p + matriz[i,5]
}
p

## [1] 0.197

Normal Distribution

The gaussian (or bell-shaped) distribution is the most important continuous distribution. Why? Normality arises naturally in many real contexts ranging from physical to biological, from engineering to social measurement situations. The central limit theorem (CLT) states that under certain (fairly common) conditions, the sum of many random variables will have an approximately normal distribution. Normality is also important in statistical inference.

Where it first Came From

The normal curve was first developed mathematically in 1733 by Abraham de Moivre as an approximation to the binomial distribution. Laplace used the normal curve in 1783 to describe the distribution of errors.

Shape

Symmetric smooth form with a single mode that is also the location of the mean and median. Either side of the mode there is a point of inflection of the bell curve which is one unit (one standard deviation) from the mean. Beyond this point the curve extends towards the x-axis asymptotically, with a theoretical extent to infinity in both directions.

The normal distribution is described by two parameters: the mean \(\mu\) and the standard deviation \(\sigma\) and the Normal Model is written : \(X \sim Norm(\mu, \sigma)\)

\[ X \sim Norm(\mu,\sigma) \\ \mu = \textrm{ mean or expectation of the distribution} \\ \sigma = \textrm{ standard deviation} \\ f_X(x) = \frac{1}{\sigma \sqrt {2\pi }}exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \\ E[X] = \mu \\ V[X] = \sigma^2 \\ Desv[X] = \sigma \\ \]

Solve the following problems

Percentile Computation

Find out where the percentiles of this normal distribution are. Each of the coloured bars represent the same amount of percentage.

Note that the mean of the distribution is ??= 5251 and sigma is equal to ??=1760.

The marks of 1000 computing engineers (CE) are normaly distributed with a mean equal to 48 and a standard deviation of 4. The marks of mechanical engineers (ME) also follow a normal distribution. The percentage of ME that get a mark under 0.0242 is 1.7 % and 3.9 % of them get a mark over 6.84

mu <- 48
sigma <- 4

1. If we want to identify the students with lower marks. What is the mark that leave a 0.75 % of the students under this mark

qnorm(0.0075, mu, sigma)

## [1] 38.27048

2. What are the model parameters for the mechanical engineers (ME) students

#P(x<0.0242) = 0.017
#P(x>6.84) = 0.039

#Let Y=(x-mu)/sigma -> P(Y < 0.0242) = P(Y < (0.0242-mu)/sigma) -> P(Y <= (0.0241-mu)/sigma) =0.017 -> (0.0241-mu)/sigma = 0.017
#sigma = 0.017 * (0.0241-mu) -> sigma = 0.0004097 - 0.017mu
#Following the same formula with the other probability :
#(6.84-mu)/sigma = 0.039 -> We substitute the sigma -> (6.84-mu)/(0.0004097-0.017mu) = 0.039 
#6.84-mu = 0.000001938 - 0.00663mu -> mu - 0.00663mu = 6.84 - 0.000001938 -> mu = 6.839 / 0.993 -> mu = 6.88
muMe <- 6.88
sigmaMe <- 0.0004097 - (0.017*muMe)
sigmaMe

## [1] -0.1165503

3. Find the value of the RV. (x) if we know that the probability that RV is greater than (x) is 0.199

qnorm(0.199, mu, sigma)

## [1] 44.61921

4. Find the value of the RV. (x) if we know that the probability that RV is greater than (x) is 0.978

qnorm(0.978, mu, sigma)

## [1] 56.05636

5. What is the probability that the RV. is less than 52.8

pnorm(52.8, mu, sigma)

## [1] 0.8849303

6. If we want to identify the students with lower marks. What is the mark that leave a 2 % of the students under this mark

qnorm(0.02, mu, sigma)

## [1] 39.785

7. Find the value of the RV. (x) if we know that the probability that RV is greater than (x) is 0.427

qnorm(0.427, mu, sigma)

## [1] 47.26393

8. What is the probability that the random variable lay between 48.6 and 50.7

pnorm(50.7, mu, sigma)-pnorm(48.6, mu, sigma)

## [1] 0.1905444