In Topic 3 we introduced the concept of probability. In this computer lab, we will cover how to carry out probability calculations in R, focusing on important continuous and discrete random variables and probability distributions.
By the end of this lab, you should feel confident in using and distinguishing between R functions for computing the following Normal and Binomial distribution aspects:
In Lecture 4 (Part 2), we covered material that will be very helpful for today’s computer lab. If you have not already watched Lecture 4 (Part 2), you may wish to do so now.
After working through the questions in this computer lab, you will be ready to complete Quiz 4. If you have time during today’s lab, you may like to work on the quiz.
norm
R functions🏡 In this question, we will learn how to use the various norm
functions in R, which are related to the Normal Distribution (hence norm
).
Firstly, recall that we express the distribution of a normally distributed random variable \(X\) as \[X \sim N(\mu, \sigma^2).\] Notice that we use the \(\color{blue}{\text{variance}, \sigma^2}\), when specifying the \(\color{blue}{\text{distribution}}\).
\(\color{crimson}{\text{Be careful:}}\) The various norm
R functions require \(\color{crimson}{\text{$\sigma$, i.e. the standard deviation (sd)}}\), to be specified as the spread information, not the \(\color{blue}{\text{variance}}\).
As we can see below, there are 4 versions of the norm
function. The additional letter preceding norm
specifies the aspect of the Normal distribution in which we are interested.
Hence we have:
\(\color{blue}{\text{Note:}}\) If we do not specify mean
and sd
, R will assume we want to calculate values from the Standard Normal Distribution. That is, R will assume we have mean = 0
and sd = 1
.
To help us further understand how these functions work, we will now consider some examples. A walkthrough video for sections 1.1 - 1.1.2 is included below. Give these questions a go, and refer to the video if you find yourself stuck at any stage.
pnorm
🏡 Suppose we would like to know \(P(X \leq 1)\), the probability that the random variable \(X\) has an observed value less than or equal to 1 (where \(X \sim N(0,1)\)).
To calculate this, we can use the pnorm
function as follows:
pnorm(1, mean = 0, sd = 1)
This gives us a value of 0.84. This probability is equivalent to computing the area under the standard Normal curve to the left of \(x = 1\). This area is shaded in blue in the plot below:
🏡 To help solidify our understanding of the pnorm
function, let’s consider the probability that the random variable \(X\) has an observed value less than or equal to a different value, i.e. 1.5. \(P(X \leq 1.5)\) corresponds to the shaded blue area in the plot below:
If we compute this probability in R, we obtain the value 0.9332. See if you can use the pnorm
function in R to verify this result.
🏡 If we compare the probabilities \(P(X \leq 1)\) and \(P(X \leq 1.5)\), and check the plots shown in 1.1 and 1.1.1, we can see that by increasing our observed \(x\) value from 1 to 1.5, we have covered an additional area of roughly 0.092 under the curve. This additional area is shaded red in the plot below, and represents the probability that X is between 1 and 1.5. That is, that \(P(1 \leq X \leq 1.5)\):
To compute this value in R, we can use pnorm
in the following manner:
pnorm(1.5, mean = 0, sd = 1) - pnorm(1, mean = 0, sd = 1)
Note that here, the first use of pnorm
is equivalent to calculating the blue shaded area in 1.1.1, while the second use of pnorm
is equivalent to calculating the blue shaded area in 1.1. By subtracting the second area from the first (i.e. pnorm(1.5, mean = 0, sd = 1) - pnorm(1, mean = 0, sd = 1)
), we end up with just the red shaded area shown in 1.1.2, i.e. \(P(1 \leq X \leq 1.5)\), as desired.
💻 As you can see, not only can we use pnorm
to compute the probability that the random variable \(X\) has an observed value less than or equal to some value (e.g. \(x=1\)), but we can also use pnorm
to compute the probability that the random variable \(X\) takes an observed value between two arbitrary values (e.g. between \(x=1\) and \(x=1.5\)).
To conclude this section, use the pnorm
function to determine the following probabilities. For each of these calculations, try and sketch (by hand, not using R!) a picture of the standard Normal distribution, and then fill in the relevant area under the curve.
A walkthrough video for questions a-e is included below. Give these questions a go, and refer to the video if you find yourself stuck at any stage.
\(^\dagger\)Hint: For g
and h
, think about what the total area under a valid probability density curve must equal.
qnorm
💻 Note that in 1.1, our distribution function value was \(P(X \leq 1) = 0.8413\).
Suppose that we knew \(X \sim N(0,1)\), and were given the probability value 0.8413447, but did not know the value of \(x\) to which it corresponded (i.e. the quantile value). In this instance, we could calculate this value of \(x\) using the R function qnorm
as follows:
qnorm(0.8413447, mean = 0, sd = 1)
## [1] 0.9999998
As you can see, this gives us our quantile value of \(x = 1\) (for a reasonable number of decimal places of accuracy). To help visualise this, take a look at the plot below:
Note here that the area under the curve to the left of the quantile \(x=1\) (shown with the dashed red line) is equal to \(P(X \leq 1) = 0.8413447\) (shown in blue).
💻 To conclude this section, use the qnorm
function to determine the following quantiles. Remember, for each of these calculations, it might be helpful to sketch (by hand, not using R!) a picture of the standard Normal distribution, and then fill in the relevant area under the curve (roughly) to determine the position of the quantile value.
Make sure to round your answers to 2 decimal places.
rnorm
(if you have time)💻 Next, we use the rnorm
function to randomly generate 10000 values from a Normal distribution with mean = 0
and sd = 1
. Note that since we are using these parameter values, the values will actually be generated from the standard Normal distribution.
Check the code below, and make sure you understand the process.
set.seed(2)
y <- rnorm(10000, mean = 0, sd = 1)
Note that we run the command set.seed(2)
here to ensure that every time we generate this document (and in doing so generate the random sample y
), the same random number generation process is applied, so our results will not change - if we do not specify a seed, every time we generate y
our results will be different.
💻 The following code plots a histogram of our randomly generated data, and overlays a standard normal curve. Run the below code to create the plot. We will have more practice in Question 3.9.
hist(y, xlab = "x value", main = "Example Plot", col = "skyblue", freq = FALSE)
curve(dnorm(x, mean = 0, sd = 1),
col = "orange", yaxt = "n", lwd = 3, add = TRUE)
Note that the histogram won’t look perfectly bell-shaped, since we are considering a random sample of values.
dnorm
(if you have time)💻 Suppose we want to obtain the density of this curve, at the value \(x=1\). To do this, we specify our value of 1 and use the dnorm
function as follows:
dnorm(1, mean = 0, sd = 1)
This gives us a value of 0.242. Let’s visualise this point on our density curve.
Note that since this is a Normal distribution (and therefore it is symmetrical), the horizontal red line will also intersect the curve at the point (\(x = -1, y = 0.2420\)). See if you can adapt the code provided above to verify this for yourself in R.
binom
R functions🏡 Hopefully, working through 1 has made the differences and usages of the different norm
functions much clearer.
As you may have guessed, these density, distribution, quantile and random number generating functions are not restricted to the Normal distribution.
Let us now consider applying what we have learnt in 1 to the Binomial Distribution, which is a discrete distribution. In discrete distributions, we have a finite number of possible outcomes, and each of these outcomes has a probability of occurring, with the sum of these probabilities equal to 1.
🏡 Recall the playing cards example introduced in the Topic 3 Lecture. In this example, we guessed the suit of a playing card (hearts, diamonds, spades or clubs) 10 times, with replacement, from a standard deck of playing cards.
Since there are only 4 suits, and each suit has the same number of cards, the probability of a correct guess was 25%, or 0.25.
Your number of correct guesses, which we will refer to as \(X\), had a range from 0 up to 10. However, given the probability of a correct guess was 25%, it is highly unlikely that you will have made a large number of correct guesses (if you did, well done!).
We can actually quantify the probability associated with making a certain number of correct guesses, using the Binomial distribution.
Let’s take a look at how the Binomial distribution works.
🏡 Suppose we have \(n\) “trials”, each with an outcome of either “success” or “failure”. Further suppose that for each trial, the probability of “success” is equal to \(p\), and that \(X\) is the number of “successes” from the \(n\) trials. We can model \(X\) using the Binomial Distribution. In mathematical notation, we define the distribution as \[ X \sim BIN(n, p),\] where:
If, for example, we are interested in the probability of some number \(x\) successes (out of the \(n\) trials), we can write this probability down as \(P(X = x)\).
Given in our playing cards example we have \(n = 10\) trials with a probability of “success” of \(p = 0.25\) for each trial, we define the distribution in this example as: \[X \sim BIN(10, 0.25).\]
binom
R functions🏡 The below table provides an overview of the binom
functions we will be using in this question:
Notes:
pbinom
function is called q
. However in the above table, we have referred to it as x
for ease of notation.qbinom
and rbinom
functions are also available in R. If you would like to learn more about these functions, you can find more information in the help file, which can be opened by running ?qbinom
.dbinom
💻 Let’s try computing \(P(X=x)\) in R, for a few different values of \(x\). To do so, we will be using the dbinom
function.
Recalling that in our playing cards example we have \(n = 10\) trials, each with a probability of success of \(p = 0.25\), we can compute \(P(X=1)\) by running following code:
dbinom(1, 10, 0.25)
Make sure you understand the arguments here before proceeding. Using the above code, if you guessed the suit 10 times, what is the probability of guessing correctly exactly once?
A walkthrough video for this question is included below. Please refer to the video if you find yourself stuck at any stage with the dbinom
coding process. If you feel comfortable, proceed to section 2.4.1.
💻 Using the details in 2.4 as a guide, compute the following probabilities for \(X \sim BIN(10, 0.25)\):
What do you notice about these results?
pbinom
💻 We can use the pbinom
function to compute \(P(X \leq x)\) for \(X \sim BIN(n, p)\).
Use pbinom
to determine the following probabilities assuming that \(X \sim BIN(10, 0.25)\):
💻 Suppose that one student claims they carried out the experiment twice, and guessed 7 out of 10 cards correctly in the first experiment, and then 8 out of 10 cards correctly, in the second experiment. Do you believe them?
rbinom
(if you have time)💻 Suppose that there are 20 students in your class (including yourself), and every student guesses the suit of playing card 10 times. We can simulate the number of correct guesses of each of the 20 students at once, using the rbinom
function.
Check the comments in the code below to help you get started, refer back to 2.4 for argument details, and try to simulate these results. Store your results in an object called y
. Display the results stored in y in the R console, and explain what the 20 numbers represent.
# rbinom takes 3 arguments, namely, n, size, and prob.
# Run the line below for more details
?rbinom
💻 Use the summary
function to assess your newly generated data stored in y
. Considering this simulated group of 20 students, what was the average number of correct guesses? What was the maximum number of correct guesses? Are you surprised by these results?
💻 For this question, we will assess a data set on secondary school Portuguese students’ performances in Mathematics classes (UCI Machine Learning Repository 2014).
This data set is stored in the student-mat.csv
file in the Week 4 Module on LMS. Download this file now, and save it in a relevant location on your PC.
💻 Open up RStudio and create a new script file. As you work through the questions, you can copy and paste the code provided below into your script file.
💻 To begin, set your working directory to the folder in which you have saved the student-mat.csv
file, so that R can access our data set.
Hint: Need a refresher on how to import the data into R? Watch the video “How do I import data into R?” which is available on LMS, just below the student-mat.csv
file.
💻 Now load in your data to RStudio, using the code below.
Note: By default, the csv ‘separator’ is normally a comma, but in this file, it’s a semi-colon. Therefore, when importing the file, you will need to include the additional sep = ";"
as shown below:
data <- read.csv("student-mat.csv", sep = ";")
💻 Use View(data)
to look at the data set.
💻 As you may have found, there are 33 different variables, for 395 students! Some of these variables are categorical, while others are numeric.
Let’s focus on a subset, namely G1
, G2
and G3
. These are the first period grade, the second period grade, and the final grade for the year respectively, for students in Maths (note that the school year in Portugal is split into three terms or periods). We will also focus on the absences
variable, which denotes the number of absences from class a student had over the year.
To create our subset, run the following code:
results <- data[, c("absences", "G1", "G2", "G3")]
Note here we are only selecting the columns of data in which we are interested.
💻 For the following questions, we are only interested in non-zero G1
, G2
and G3
results, so let’s run the following code to remove any zeros from these results.
results$G1[results$G1 == 0] <- NA # Set any 0s in G1 to NA
results$G2[results$G2 == 0] <- NA # Set any 0s in G2 to NA
results$G3[results$G3 == 0] <- NA # Set any 0s in G3 to NA
results <- na.omit(results) # Remove 0s (now NAs) from the data
💻 Recall from last week’s computer lab that we can create a histogram of data using the hist
command. Create a histogram of the G1
results now, and add suitable labels to the x axis and title.
Hint: You can check the code chunk below if you need to refresh your memory.
hist(x = results$G1, main = "First Period Results", xlab = "Score")
💻 We can overlay our histogram with a Normal distribution curve.
Note: For this overlay to be accurate, we need to add the argument Freq = FALSE
into our hist
code. This will ensure the histogram is plotted with densities shown on the vertical axis.
The code for overlaying a Normal distribution curve on our G1
histogram is shown below. Read through this code carefully and make sure you understand each component, then run the code.
hist(x = results$G1, freq = FALSE,
main = "First Period Results",
xlab = "Score", ylim = c(0, 0.14))
# Note that we have added the `ylim` command to add some space between the histogram and the title.
curve(dnorm(x, mean = mean(results$G1), sd = sd(results$G1)), col="green", yaxt="n", lwd=2, add=TRUE)
💻 Using the code in 3.7 and 3.8 as a guide, plot histograms for each of the 4 selected variables (G1
, G2
, G3
and absences
), and overlay suitable Normal distribution curves on these histograms.
Before you plot your histograms, run the R code below, to ensure that all plots appear in the 1 graph.
windows() # or quartz() for mac
par(mfrow = c(2,2), cex = 0.8, mex = 0.8) # This sets the plotting parameters to plot 4 plots in the 1 graph (2 rows and 2 columns). cex and mex change the text size and margins respectively.
💻 Now that you have overlaid normal curves to your histograms, what conclusions can you make about the data? Do you think that any of the histograms you just created looks normally distributed? List some details about each histogram to support your conclusions.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.