Topic 3: Probability and Distributions


These are the solutions for Computer Lab 4.


1 Using the norm R functions

Example R code is shown in the questions below, where relevant.

1.1 pnorm

No answer required.

1.1.1

We could verify this value using:

round(pnorm(1.5, mean = 0, sd = 1), 4)
## [1] 0.9332

1.1.2

No answer required.

1.1.3

pnorm(2, mean = 0, sd = 1)
## [1] 0.9772499
pnorm(-1, mean = 0, sd = 1)
## [1] 0.1586553
pnorm(1, mean = 2, sd = 1)
## [1] 0.1586553

Notice that the result is the same as for b, because we have shifted our curve two units to the right (so it is now centred at 2 rather than 0) but the spread has not changed.

pnorm(1, mean = 0, sd = sqrt(2))
## [1] 0.7602499
pnorm(2, mean = 0, sd = sqrt(3)) - pnorm(1, mean = 0, sd = sqrt(3))
## [1] 0.1577449
pnorm(1, mean = 0, sd = 2) - pnorm(-1, mean = 0, sd = 2)
## [1] 0.3829249

Note here that we could use the alternate approach below, which makes use of the symmetry property.

1- 2 * pnorm(-1, mean = 0, sd = 2)
## [1] 0.3829249
1 - pnorm(2, mean = 0, sd = sqrt(5))
## [1] 0.1855467

Note here that we are using the complement rule.

1 - pnorm(-1, mean = 0, sd = 3)
## [1] 0.6305587

Note here that we are using the complement rule.

pnorm(-2, mean = 0, sd = 1) + (1 - pnorm(2, mean = 0, sd = 1))
## [1] 0.04550026

Note here that we could use the alternate approach below, which makes use of the symmetry property.

2 * pnorm(-2, mean = 0, sd = 1)
## [1] 0.04550026

1.2 qnorm

No answer required.

1.2.1

round(qnorm(0.9772499, mean = 0, sd = 1), 2)
## [1] 2
qnorm(0.5, mean = 0, sd = 1)
## [1] 0
round(qnorm(0.7733726, mean = 0, sd = 1), 2)
## [1] 0.75
round(qnorm(0.2742531, mean = 1, sd = 1), 2)
## [1] 0.4
round(qnorm(0.7421539, mean = 3, sd = sqrt(2)), 2)
## [1] 3.92

1.3 rnorm (if you have time)

set.seed(2)
y <- rnorm(10000, mean = 0, sd = 1) 

1.3.1

hist(y, xlab = "x value", main = "Example Plot", col = "skyblue", freq = FALSE)
curve(dnorm(x, mean = 0, sd = 1), 
      col = "orange", yaxt = "n", lwd = 3, add = TRUE)

1.4 dnorm (if you have time)

xval <- 1
dnorm(xval, mean = 0, sd = 1)
## [1] 0.2419707

To compute the density at \(x=-1\), we can use the code

dnorm(-1, mean = 0, sd = 1)
## [1] 0.2419707

1.4.1

The density at \(x=-1.5\) would be the same as the density at \(x=1.5\), i.e., \(0.1295\).

2 Using the binom R functions

2.1 Playing Cards Example

No answer required.

2.2 The Binomial Distribution

No answer required.

2.3 Overview of binom R functions

No answer required.

2.4 dbinom

The probability of guessing correctly exactly once out of ten guesses is

dbinom(1, 10, 0.25)
## [1] 0.1877117

2.4.1

We have

The probability of making zero correct guesses out of ten guesses is

dbinom(0, 10, 0.25)
## [1] 0.05631351

The probability of guessing correctly exactly twice out of ten guesses is

dbinom(2, 10, 0.25)
## [1] 0.2815676

The probability of guessing correctly exactly three times out of ten guesses is

dbinom(3, 10, 0.25)
## [1] 0.2502823

The probability of guessing correctly exactly nine times out of ten guesses is

dbinom(9, 10, 0.25)
## [1] 2.861023e-05

The probability of guessing correctly exactly ten times out of ten guesses is

dbinom(10, 10, 0.25)
## [1] 9.536743e-07

We notice that (as expected) the probability associated with a high number of successes is lower.

2.5 pbinom

We have:

pbinom(6, 10, 0.25)
## [1] 0.9964943
pbinom(3, 10, 0.25)
## [1] 0.7758751
1 - pbinom(3, 10, 0.25)
## [1] 0.2241249
1 - pbinom(8, 10, 0.25)
## [1] 2.95639e-05
pbinom(8, 10, 0.25) - pbinom(6, 10, 0.25)
## [1] 0.003476143

Note that for e, since the Binomial distribution is a discrete distribution, we could also have used dbinom here, i.e. dbinom(7, 10, 0.25) + dbinom(8, 10, 0.25).

pbinom(6, 10, 0.25)
## [1] 0.9964943
pbinom(5, 10, 0.25)
## [1] 0.9802723

You could use either

1 - pbinom(8, 10, 0.25)
## [1] 2.95639e-05

or

dbinom(9, 10, 0.25) + dbinom(10, 10, 0.25)
## [1] 2.95639e-05

Again, you could use either

1 - pbinom(7, 10, 0.25)
## [1] 0.000415802

or

dbinom(8, 10, 0.25) + dbinom(9, 10, 0.25) + dbinom(10, 10, 0.25)
## [1] 0.000415802

You could use either

pbinom(7, 10, 0.25) - pbinom(4, 10, 0.25)
## [1] 0.07771111

or

dbinom(5, 10, 0.25) + dbinom(6, 10, 0.25) + dbinom(7, 10, 0.25)
## [1] 0.07771111

2.5.1

It is highly unlikely (but not impossible) that they are telling the truth. Using our results from 2.5, the probability of making more than 6 correct guesses out of 10 is 0.0035057, which is extremely small. This probability is almost equal to the probability of making between 7 to 8 correct guesses. To achieve this twice is highly unlikely. In conclusion the student is probably lying.

2.6 rbinom (if you have time)

To generate data, we use the rbinom R function, with n = 20, size = 300 and prob = 0.25, i.e.:

y <- rbinom(n = 20, size = 10, prob = 0.25)
y # look at results
##  [1] 4 4 1 2 5 4 0 2 4 0 2 6 2 1 1 3 3 2 2 2

The 20 numbers represent the numbers of correct guesses made by each student.

2.6.1

Note that for this question, these answers may not match yours, since we are dealing with randomly generated values. The process however remains the same.

Note that these summary results will not make sense if you have mixed up the n and size arguments.

summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.75    2.00    2.50    4.00    6.00

Here we can see that the average number of correct guesses was 2.5, and the maximum number of correct guesses was 6. These values aren’t too surprising.

3 Overlaying a Normal curve on a Histogram

Example R code for the analysis of the Portuguese student data (UCI Machine Learning Repository 2014) is shown in the questions below, where relevant.

3.1

No answer required.

3.2

No answer required.

3.3

data <- read.csv("student-mat.csv", sep = ";")

3.4

No answer required.

3.5

results <- data[, c("absences", "G1", "G2", "G3")]

3.6

results$G1[results$G1 == 0] <- NA # Set any 0s in G1 to NA
results$G2[results$G2 == 0] <- NA # Set any 0s in G2 to NA
results$G3[results$G3 == 0] <- NA # Set any 0s in G3 to NA
results <- na.omit(results) # Remove 0s (now NAs) from the data

3.7

hist(x = results$G1, main = "First Period Results", xlab = "Score")

3.8

hist(x = results$G1, freq = FALSE, 
       main = "First Period Results", xlab = "Score", ylim = c(0, 0.14))

curve(dnorm(x, mean = mean(results$G1), sd = sd(results$G1)), 
      col="green", yaxt="n", lwd=2, add=TRUE)

3.9

par(mfrow = c(2,2), cex = 0.8, mex = 0.8)
hist(x = results$G1, freq = FALSE, 
       main = "First Period Results", xlab = "Score", ylim = c(0, 0.14))
curve(dnorm(x, mean = mean(results$G1), sd = sd(results$G1)), 
      col="green", yaxt="n", lwd=2, add=TRUE)
      
hist(x = results$G2, freq = FALSE, 
       main = "Second Period Results", xlab = "Score", ylim = c(0, 0.14))
curve(dnorm(x, mean = mean(results$G2), sd = sd(results$G2)), 
      col="green", yaxt="n", lwd=2, add=TRUE)
      
hist(x = results$G3, freq = FALSE, 
       main = "Final Grade", xlab = "Score", ylim = c(0, 0.14))
curve(dnorm(x, mean = mean(results$G3), sd = sd(results$G3)), 
      col="green", yaxt="n", lwd=2, add=TRUE)
      
hist(x = results$absences, freq = FALSE, 
       main = "Absences", xlab = "Number", ylim = c(0, 0.14))
curve(dnorm(x, mean = mean(results$absences), sd = sd(results$absences)), 
      col="green", yaxt="n", lwd=2, add=TRUE)

3.10

This is somewhat open to interpretation. One could say that the First Period Results and the Final Grade histograms look roughly normally distributed, as the bins appear to roughly follow the normal distribution bell curve. However, one could also observe that these two histograms are both slightly positively skewed (and hence do not look normally distributed). The histogram closest to being normally distributed is the first one (the Final Grade histogram has a heavier than expected left tail). The absences histogram is clearly not normally distributed, and the Second Period Results histogram is probably not normally distributed, although if you changed the number of bins, this observation could also change.


That’s everything for now! If there were any parts you were unsure about, take a look back over the relevant sections of the Topic 3 material.


References

UCI Machine Learning Repository. 2014. “Student Performance Data Set [.csv File].” 2014. https://archive.ics.uci.edu/ml/datasets/student+performance.


These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.