Topic 3: Probability and Distributions

In Topic 3 we introduced the concept of probability. In this computer lab, we will cover how to carry out probability calculations in R, focusing on important continuous and discrete random variables and probability distributions.

By the end of this lab, you should feel confident in using and distinguishing between R functions for computing the following Normal and Binomial distribution aspects:

distribution function
quantile function
random deviates
density

In Lecture 4 (Part 2), we covered material that will be very helpful for today’s computer lab. If you have not already watched Lecture 4 (Part 2), you may wish to do so now.

After working through the questions in this computer lab, you will be ready to complete Quiz 4. If you have time during today’s lab, you may like to work on the quiz.

1 Using the `norm` R functions

🏡 In this question, we will learn how to use the various norm functions in R, which are related to the Normal Distribution (hence norm).

Firstly, recall that we express the distribution of a normally distributed random variable $X$ as \[X \sim N(\mu, \sigma^2).\] Notice that we use the $\color{blue}{\text{variance}, \sigma^2}$, when specifying the $\color{blue}{\text{distribution}}$.

$\color{crimson}{\text{Be careful:}}$ The various norm R functions require $\color{crimson}{\text{$\sigma$, i.e. the standard deviation (sd)}}$, to be specified as the spread information, not the $\color{blue}{\text{variance}}$.

As we can see below, there are 4 versions of the norm function. The additional letter preceding norm specifies the aspect of the Normal distribution in which we are interested.

Hence we have:

$\color{blue}{\text{Note:}}$ If we do not specify mean and sd, R will assume we want to calculate values from the Standard Normal Distribution. That is, R will assume we have mean = 0 and sd = 1.

To help us further understand how these functions work, we will now consider some examples. A walkthrough video for sections 1.1 - 1.1.2 is included below. Give these questions a go, and refer to the video if you find yourself stuck at any stage.

1.1 `pnorm`

🏡 Suppose we would like to know $P(X \leq 1)$, the probability that the random variable $X$ has an observed value less than or equal to 1 (where $X \sim N(0,1)$).

To calculate this, we can use the pnorm function as follows:

pnorm(1, mean = 0, sd = 1)

This gives us a value of 0.84. This probability is equivalent to computing the area under the standard Normal curve to the left of $x = 1$. This area is shaded in blue in the plot below:

1.1.1

🏡 To help solidify our understanding of the pnorm function, let’s consider the probability that the random variable $X$ has an observed value less than or equal to a different value, i.e. 1.5. $P(X \leq 1.5)$ corresponds to the shaded blue area in the plot below:

If we compute this probability in R, we obtain the value 0.9332. See if you can use the pnorm function in R to verify this result.

1.1.2

🏡 If we compare the probabilities $P(X \leq 1)$ and $P(X \leq 1.5)$, and check the plots shown in 1.1 and 1.1.1, we can see that by increasing our observed $x$ value from 1 to 1.5, we have covered an additional area of roughly 0.092 under the curve. This additional area is shaded red in the plot below, and represents the probability that X is between 1 and 1.5. That is, that $P(1 \leq X \leq 1.5)$:

To compute this value in R, we can use pnorm in the following manner:

pnorm(1.5, mean = 0, sd = 1) - pnorm(1, mean = 0, sd = 1)

Note that here, the first use of pnorm is equivalent to calculating the blue shaded area in 1.1.1, while the second use of pnorm is equivalent to calculating the blue shaded area in 1.1. By subtracting the second area from the first (i.e. pnorm(1.5, mean = 0, sd = 1) - pnorm(1, mean = 0, sd = 1)), we end up with just the red shaded area shown in 1.1.2, i.e. $P(1 \leq X \leq 1.5)$, as desired.

1.1.3

💻 As you can see, not only can we use pnorm to compute the probability that the random variable $X$ has an observed value less than or equal to some value (e.g. $x=1$), but we can also use pnorm to compute the probability that the random variable $X$ takes an observed value between two arbitrary values (e.g. between $x=1$ and $x=1.5$).

To conclude this section, use the pnorm function to determine the following probabilities. For each of these calculations, try and sketch (by hand, not using R!) a picture of the standard Normal distribution, and then fill in the relevant area under the curve.

A walkthrough video for questions a-e is included below. Give these questions a go, and refer to the video if you find yourself stuck at any stage.

🎧 Online students

💬 For each sub-question selected by the facilitator, enter your answer next to the question in the shared Google Doc.

$P(X \leq 2)$ for $X \sim N(0, 1)$.
$P(X \leq -1)$ for $X \sim N(0, 1)$.
$P(X \leq 1)$ for $X \sim N(2, 1)$.
$P(X \leq 1)$ for $X \sim N(0, 2)$. Hint: always remember to specify the standard deviation in R rather than the variance.
$P(1 \leq X \leq 2)$ for $X \sim N(0, 3)$.
$P(-1 \leq X \leq 1)$ for $X \sim N(0, 2^2)$.
$P(X \geq 2)$ for $X \sim N(0, 5)$. $^\dagger$
$P(X \geq -1)$ for $X \sim N(0, 3^2)$. $^\dagger$
$P(X \leq -2) + P(X \geq 2)$ for $X \sim N(0, 1)$.

$^\dagger$Hint: For g and h, think about what the total area under a valid probability density curve must equal.

1.2 `qnorm`

💻 Note that in 1.1, our distribution function value was $P(X \leq 1) = 0.8413$.

Suppose that we knew $X \sim N(0,1)$, and were given the probability value 0.8413447, but did not know the value of $x$ to which it corresponded (i.e. the quantile value). In this instance, we could calculate this value of $x$ using the R function qnorm as follows:

    qnorm(0.8413447, mean = 0, sd = 1)

## [1] 0.9999998

As you can see, this gives us our quantile value of $x = 1$ (for a reasonable number of decimal places of accuracy). To help visualise this, take a look at the plot below:

Note here that the area under the curve to the left of the quantile $x=1$ (shown with the dashed red line) is equal to $P(X \leq 1) = 0.8413447$ (shown in blue).

1.2.1

💻 To conclude this section, use the qnorm function to determine the following quantiles. Remember, for each of these calculations, it might be helpful to sketch (by hand, not using R!) a picture of the standard Normal distribution, and then fill in the relevant area under the curve (roughly) to determine the position of the quantile value.

Make sure to round your answers to 2 decimal places.

🎧 Online students

💬 For each sub-question selected by the facilitator, enter your answer next to the question in the shared Google Doc.

Find $x$ if $P(X \leq x) = 0.9772499$ for $X \sim N(0,1)$.
Find $x$ if $P(X \leq x) = 0.5$ for $X \sim N(0,1)$.
Find $x$ if $P(X \leq x) = 0.7733726$ for $X \sim N(0,1)$.
Find $x$ if $P(X \leq x) = 0.2742531$ for $X \sim N(1,1)$.
Find $x$ if $P(X \leq x) = 0.7421539$ for $X \sim N(3,2)$.

1.3 `rnorm` (if you have time)

💻 Next, we use the rnorm function to randomly generate 10000 values from a Normal distribution with mean = 0 and sd = 1. Note that since we are using these parameter values, the values will actually be generated from the standard Normal distribution.

Check the code below, and make sure you understand the process.

set.seed(2)
y <- rnorm(10000, mean = 0, sd = 1)

Note that we run the command set.seed(2) here to ensure that every time we generate this document (and in doing so generate the random sample y), the same random number generation process is applied, so our results will not change - if we do not specify a seed, every time we generate y our results will be different.

1.3.1

💻 The following code plots a histogram of our randomly generated data, and overlays a standard normal curve. Run the below code to create the plot. We will have more practice in Question 3.9.

hist(y, xlab = "x value", main = "Example Plot", col = "skyblue", freq = FALSE)
curve(dnorm(x, mean = 0, sd = 1), 
      col = "orange", yaxt = "n", lwd = 3, add = TRUE)

Note that the histogram won’t look perfectly bell-shaped, since we are considering a random sample of values.

1.4 `dnorm` (if you have time)

💻 Suppose we want to obtain the density of this curve, at the value $x=1$. To do this, we specify our value of 1 and use the dnorm function as follows:

dnorm(1, mean = 0, sd = 1)

This gives us a value of 0.242. Let’s visualise this point on our density curve.

Note that since this is a Normal distribution (and therefore it is symmetrical), the horizontal red line will also intersect the curve at the point ($x = -1, y = 0.2420$). See if you can adapt the code provided above to verify this for yourself in R.

2 Using the `binom` R functions

🏡 Hopefully, working through 1 has made the differences and usages of the different norm functions much clearer.

As you may have guessed, these density, distribution, quantile and random number generating functions are not restricted to the Normal distribution.

Let us now consider applying what we have learnt in 1 to the Binomial Distribution, which is a discrete distribution. In discrete distributions, we have a finite number of possible outcomes, and each of these outcomes has a probability of occurring, with the sum of these probabilities equal to 1.

2.1 Playing Cards Example

🏡 Recall the playing cards example introduced in the Topic 3 Lecture. In this example, we guessed the suit of a playing card (hearts, diamonds, spades or clubs) 10 times, with replacement, from a standard deck of playing cards.

Since there are only 4 suits, and each suit has the same number of cards, the probability of a correct guess was 25%, or 0.25.

Your number of correct guesses, which we will refer to as $X$, had a range from 0 up to 10. However, given the probability of a correct guess was 25%, it is highly unlikely that you will have made a large number of correct guesses (if you did, well done!).

We can actually quantify the probability associated with making a certain number of correct guesses, using the Binomial distribution.

Let’s take a look at how the Binomial distribution works.

2.2 The Binomial Distribution

🏡 Suppose we have $n$ “trials”, each with an outcome of either “success” or “failure”. Further suppose that for each trial, the probability of “success” is equal to $p$, and that $X$ is the number of “successes” from the $n$ trials. We can model $X$ using the Binomial Distribution. In mathematical notation, we define the distribution as \[ X \sim BIN(n, p),\] where:

$X$ is the number of successes
$n$ is the number of trials
$p$ is the probability of success for each trial.

If, for example, we are interested in the probability of some number $x$ successes (out of the $n$ trials), we can write this probability down as $P(X = x)$.

Given in our playing cards example we have $n = 10$ trials with a probability of “success” of $p = 0.25$ for each trial, we define the distribution in this example as: \[X \sim BIN(10, 0.25).\]

2.3 Overview of `binom` R functions

🏡 The below table provides an overview of the binom functions we will be using in this question:

Notes:

In R, the first argument in the pbinom function is called q. However in the above table, we have referred to it as x for ease of notation.
Although not required for this computer lab, the qbinom and rbinom functions are also available in R. If you would like to learn more about these functions, you can find more information in the help file, which can be opened by running ?qbinom.

2.4 `dbinom`

💻 Let’s try computing $P(X=x)$ in R, for a few different values of $x$. To do so, we will be using the dbinom function.

Recalling that in our playing cards example we have $n = 10$ trials, each with a probability of success of $p = 0.25$, we can compute $P(X=1)$ by running following code:

dbinom(1, 10, 0.25)

Make sure you understand the arguments here before proceeding. Using the above code, if you guessed the suit 10 times, what is the probability of guessing correctly exactly once?

A walkthrough video for this question is included below. Please refer to the video if you find yourself stuck at any stage with the dbinom coding process. If you feel comfortable, proceed to section 2.4.1.

2.4.1

💻 Using the details in 2.4 as a guide, compute the following probabilities for $X \sim BIN(10, 0.25)$:

$P(X=0)$
$P(X=2)$
$P(X=3)$
$P(X=9)$
$P(X=10)$

What do you notice about these results?

🎧 Online students

💬 For each sub-question selected by the facilitator, enter your answer next to the question in the shared Google Doc.

2.5 `pbinom`

💻 We can use the pbinom function to compute $P(X \leq x)$ for $X \sim BIN(n, p)$.

Use pbinom to determine the following probabilities assuming that $X \sim BIN(10, 0.25)$:

The probability of making 6 or less correct card guesses out of the 10 trials?
The probability of making 3 or less correct card guesses out of the 10 trials?
The probability of making more than 3 correct card guesses out of the 10 trials?
The probability of making more than 8 correct card guesses out of the 10 trials?
The probability of making 7 or 8 correct card guesses out of the 10 trials? (Think carefully about this.)
$P(X \leq 6)$
$P(X < 6)$ (think carefully about this)
$P(X > 8)$
$P(X \geq 8)$
$P(5 \leq X < 8)$

🎧 Online students

💬 For each sub-question selected by the facilitator, enter your answer next to the question in the shared Google Doc.

2.5.1

💻 Suppose that one student claims they carried out the experiment twice, and guessed 7 out of 10 cards correctly in the first experiment, and then 8 out of 10 cards correctly, in the second experiment. Do you believe them?

🎧 Online students

💬 Enter your answer in the chat.

2.6 `rbinom` (if you have time)

💻 Suppose that there are 20 students in your class (including yourself), and every student guesses the suit of playing card 10 times. We can simulate the number of correct guesses of each of the 20 students at once, using the rbinom function.

Check the comments in the code below to help you get started, refer back to 2.4 for argument details, and try to simulate these results. Store your results in an object called y. Display the results stored in y in the R console, and explain what the 20 numbers represent.

# rbinom takes 3 arguments, namely, n, size, and prob. 
# Run the line below for more details
?rbinom

2.6.1

💻 Use the summary function to assess your newly generated data stored in y. Considering this simulated group of 20 students, what was the average number of correct guesses? What was the maximum number of correct guesses? Are you surprised by these results?

3 Overlaying a Normal curve on a Histogram

💻 For this question, we will assess a data set on secondary school Portuguese students’ performances in Mathematics classes (UCI Machine Learning Repository 2014).

This data set is stored in the student-mat.csv file in the Week 4 Module on LMS. Download this file now, and save it in a relevant location on your PC.

3.1

💻 Open up RStudio and create a new script file. As you work through the questions, you can copy and paste the code provided below into your script file.

3.2

💻 To begin, set your working directory to the folder in which you have saved the student-mat.csv file, so that R can access our data set.

Hint: Need a refresher on how to import the data into R? Watch the video “How do I import data into R?” which is available on LMS, just below the student-mat.csv file.

3.3

💻 Now load in your data to RStudio, using the code below.

Note: By default, the csv ‘separator’ is normally a comma, but in this file, it’s a semi-colon. Therefore, when importing the file, you will need to include the additional sep = ";" as shown below:

data <- read.csv("student-mat.csv", sep = ";")

3.4

💻 Use View(data) to look at the data set.

3.5

💻 As you may have found, there are 33 different variables, for 395 students! Some of these variables are categorical, while others are numeric.

Let’s focus on a subset, namely G1, G2 and G3. These are the first period grade, the second period grade, and the final grade for the year respectively, for students in Maths (note that the school year in Portugal is split into three terms or periods). We will also focus on the absences variable, which denotes the number of absences from class a student had over the year.

To create our subset, run the following code:

results <- data[, c("absences", "G1", "G2", "G3")]

Note here we are only selecting the columns of data in which we are interested.

3.6

💻 For the following questions, we are only interested in non-zero G1, G2 and G3 results, so let’s run the following code to remove any zeros from these results.

results$G1[results$G1 == 0] <- NA # Set any 0s in G1 to NA
results$G2[results$G2 == 0] <- NA # Set any 0s in G2 to NA
results$G3[results$G3 == 0] <- NA # Set any 0s in G3 to NA
results <- na.omit(results) # Remove 0s (now NAs) from the data

3.7

💻 Recall from last week’s computer lab that we can create a histogram of data using the hist command. Create a histogram of the G1 results now, and add suitable labels to the x axis and title.

Hint: You can check the code chunk below if you need to refresh your memory.

hist(x = results$G1, main = "First Period Results", xlab = "Score")

3.8

💻 We can overlay our histogram with a Normal distribution curve.

Note: For this overlay to be accurate, we need to add the argument Freq = FALSE into our hist code. This will ensure the histogram is plotted with densities shown on the vertical axis.

The code for overlaying a Normal distribution curve on our G1 histogram is shown below. Read through this code carefully and make sure you understand each component, then run the code.

hist(x = results$G1, freq = FALSE, 
     main = "First Period Results", 
     xlab = "Score", ylim = c(0, 0.14))
# Note that we have added the `ylim` command to add some space between the histogram and the title.

curve(dnorm(x, mean = mean(results$G1), sd = sd(results$G1)), col="green", yaxt="n", lwd=2, add=TRUE)

3.9

💻 Using the code in 3.7 and 3.8 as a guide, plot histograms for each of the 4 selected variables (G1, G2, G3 and absences), and overlay suitable Normal distribution curves on these histograms.

Before you plot your histograms, run the R code below, to ensure that all plots appear in the 1 graph.

windows() # or quartz() for mac
par(mfrow = c(2,2), cex = 0.8, mex = 0.8) # This sets the plotting parameters to plot 4 plots in the 1 graph (2 rows and 2 columns). cex and mex change the text size and margins respectively.

3.10

💻 Now that you have overlaid normal curves to your histograms, what conclusions can you make about the data? Do you think that any of the histograms you just created looks normally distributed? List some details about each histogram to support your conclusions.

🎧 Online students

💬 Volunteer to share your screen and explain your answers to this question.

Well done, that’s everything for today. There were quite a few questions, so if you didn’t manage to complete everything during the lab, don’t worry - just make sure to finish off any remaining questions before the next core lab.

References

UCI Machine Learning Repository. 2014. “Student Performance Data Set [.csv File].” 2014. https://archive.ics.uci.edu/ml/datasets/student+performance.

These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.

STM1001: Computer Lab 4

Topic 3: Probability and Distributions

1 Using the norm R functions

1.1 pnorm

1.1.1

1.1.2

1.1.3

1.2 qnorm

1.2.1

1.3 rnorm (if you have time)

1.3.1

1.4 dnorm (if you have time)

2 Using the binom R functions

2.1 Playing Cards Example

2.2 The Binomial Distribution

2.3 Overview of binom R functions

2.4 dbinom

2.4.1

2.5 pbinom

2.5.1

2.6 rbinom (if you have time)

2.6.1

3 Overlaying a Normal curve on a Histogram

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

Well done, that’s everything for today. There were quite a few questions, so if you didn’t manage to complete everything during the lab, don’t worry - just make sure to finish off any remaining questions before the next core lab.

References

1 Using the `norm` R functions

1.1 `pnorm`

1.2 `qnorm`

1.3 `rnorm` (if you have time)

1.4 `dnorm` (if you have time)

2 Using the `binom` R functions

2.3 Overview of `binom` R functions

2.4 `dbinom`

2.5 `pbinom`

2.6 `rbinom` (if you have time)