Introduction

In this homework assignment, you will explore basic probability concepts using the built-in iris dataset in R. The dataset contains 150 observations of iris flowers, categorized into three species: Setosa, Versicolor, and Virginica. Variables include the flower’s sepal and petal lengths and widths, which will help us calculate different probabilities and apply various rules of probability.

Let’s begin by loading the dataset and reviewing its structure.

Introduction to the iris Dataset

The iris dataset includes the following variables:

  • Sepal.Length: Length of the sepal in centimeters.
  • Sepal.Width: Width of the sepal in centimeters.
  • Petal.Length: Length of the petal in centimeters.
  • Petal.Width: Width of the petal in centimeters.
  • Species: The species of the iris flower (Setosa, Versicolor, Virginica).
# Load the iris dataset
data(iris)

# View the first 6 rows of the dataset
head(iris)
# An image of different species of iris flowers. Note the sepals and petals -- these are what we are measuring in our data!
knitr::include_graphics("iris_flowers.png")


PART 1: Recall Descriptive Statistics: Probability will Bridge the Gap Between Descriptive and Inferential Statistics

Descriptive statistics summarize data, while inferential statistics allow us to make predictions or inferences about a population based on sample data. First, let’s calculate some basic descriptive statistics for Sepal.Length.

# PLAY ME: Use the summary() function on `iris$Sepal.Length` to calculate basic descriptive statistics for Sepal.Length
summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900
# NOW YOU TRY: Use the boxplot() function on `iris$Sepal.Length` to create a boxplot of this variable. Note: You may want to include the command `horizontal = TRUE` inside the function to make your boxplot horizontal, but this is not required.
boxplot(iris$Sepal.Length, horizantal = TRUE)

QUESTION 1: Based on the summary statistics, what are the mean and median for Sepal.Length? What does this tell you about the distribution and skewness of this variable? Does your boxplot agree with this conclusion?

ANSWER 1: Mean is 5.84 and median is 5.8

Now that we’ve summarized some of the data, let’s begin exploring probabilities based on this dataset. This will help us infer about the population of iris flowers.


PART 2: Sample Space and Events

The sample space consists of all possible outcomes. For example, the sample space for Species in the iris dataset includes setosa, versicolor, and virginica. These three species make up the sample space for Species because an iris flower must be one of these three species! Let’s consider the following events:

In R, you can find the probability of a simple event by calculating the mean value of that event.

# PLAY ME: Probability of Event A (Species is Setosa)
 mean(iris$Species == "setosa")
## [1] 0.3333333

QUESTION 2: What is the probability of Event A, i.e., the flower being of species Setosa?

ANSWER 2: 0.33

# PLAY ME: Probability of Event B (Sepal.Length > 6 cm)
mean(iris$Sepal.Length > 6)
## [1] 0.4066667

QUESTION 3: What is the probability of Event B, i.e., the flower having a Sepal.Length greater than 6 cm?

ANSWER 3: 0.38

# NOW YOU TRY: Calculate the probability that an iris flower is of the species "virginica"
mean(iris$Species == "setosa" & iris$Sepal.Length > 6)
## [1] 0

You can also find the probability of a joint event in R by taking the mean of both events happening at the same time. For example, we can calculate the joint probability of Event A (The flower is of species setosa) & Event B (The flower has a Sepal.Length greater than 6 cm) using the following code:

# PLAY ME: Calculate the joint probability of Event A (The flower is of species setosa) and Event B (The flower has a `Sepal.Length` greater than 6 cm)
mean(iris$Species == "setosa" & iris$Sepal.Length > 6)
## [1] 0

QUESTION 4: What is the joint probability of an iris flower being species “setosa” and having a sepal length greater than 6 cm?

ANSWER 4: 0.13

# NOW YOU TRY: Calculate the joint probability that an iris flower is of the species "virginica" and has `Sepal.Length` > 6 cm
mean(iris$Species == "setosa" & iris$Sepal.Length > 6)
## [1] 0

QUESTION 5: What is the joint probability that an iris flower is of the species “virginica” and has sepal length greater than 6 cm? Based on your answer in QUESTION 4, what can you infer about the size of “setosa” iris flowers compared to the size of “virginica” iris flowers?

ANSWER 5: 0.28

QUESTION 6: If you came across a wild iris flower with a sepal length longer than 6 cm, would you infer that it was most likely of the species “setosa” or “virginica”?

ANSWER 6: a flower with a Sepal.Length > 6 cm is more likely to be of the species virginica, as the probability for Virginica is higher than for Setosa.

Aren’t statistics cool?


PART 3: Mutually Exclusive and Collectively Exhaustive Events

Mutually exclusive events cannot occur at the same time. For example, a flower cannot be both Setosa and Versicolor. Collectively exhaustive events cover all possible outcomes, ensuring that at least one of the events must happen.

QUESTION 7: Are the events “Species is Setosa” and “Sepal length is greater than 6 cm” mutually exclusive? Why or why not?

ANSWER 7: No, these events are not mutually exclusive, because it’s possible for a flower to be both Setosa and have a Sepal.Length greater than 6 cm, as shown in the joint probability calculation.

QUESTION 8: Are the three species (Setosa, Versicolor, and Virginica) collectively exhaustive? Are they mutually exclusive? Explain your reasoning.

ANSWER 8:These species are collectively exhaustive, as every flower in the dataset must be one of the three species. They are also mutually exclusive, because a flower can only be one species at a time.


PART 4: Simple (Marginal) and Joint Probability

Step 1: Creating Binary Variables

We will now create two binary variables based on continuous variables:

  • Wide_Sepal: 1 if Sepal.Width is greater than 3 cm, 0 otherwise.
  • Long_Petal: 1 if Petal.Length is greater than 4 cm, 0 otherwise.
# PLAY ME: Create the binary variables
iris$Wide_Sepal <- ifelse(iris$Sepal.Width > 3, 1, 0)
iris$Long_Petal <- ifelse(iris$Petal.Length > 4, 1, 0)

# View the first 6 rows of the modified dataset. Note the two new variables on the right of the table: Wide_Sepal and Long_Petal.
head(iris)

Step 2: Creating a Contingency Table

A contingency table helps us observe the frequency of joint occurrences of two or more categorical variables. In this case, we are looking at Wide_Sepal (whether the flower has a wide sepal) and Long_Petal (whether the flower has a long petal).

The marginal totals (the sums of the rows and columns) represent the total frequencies for each individual event. These totals help us calculate marginal probabilities, such as the probability of having a wide sepal regardless of petal length.

Here’s how you can generate a contingency table with marginal totals:

# PLAY ME: Create a contingency table
addmargins(table(Wide_Sepal = iris$Wide_Sepal, Long_Petal = iris$Long_Petal))
##           Long_Petal
## Wide_Sepal   0   1 Sum
##        0    24  59  83
##        1    42  25  67
##        Sum  66  84 150

Note: A value of 1 for Wide_Sepal indicates a flower had a wide sepal, 0 indicates it did not have a wide sepal. Likewise for Long_Petal.

QUESTION 9: Based on the contingency table, what is the marginal probability of having a wide sepal? Note: You may need to calculate this by hand using the table.

ANSWER 9: 0.6

Similar to before, you can check your work (that is, check the marginal probability of a flower having a wide sepal, regardless of petal length) by calculating the mean of a flower having a wide sepal. Was your hand-calculation correct?

# PLAY ME: Marginal probability of having a wide sepal
mean(iris$Wide_Sepal == 1)
## [1] 0.4466667

As we’ve learned in class, you can also calculate joint probabilities from a contingency table. Use your contingency table above to calculate the probability that an iris flower has a wide sepal AND a long petal.

QUESTION 10: Based on your contingency table, what is the probability that a flower has a wide sepal AND long petal? Note: You will need to calculate this answer by hand.

ANSWER 10: 0.3

As we’ve shown previously, R can also calculate joint probabilities. Follow the instructions in the chunk below.

# NOW YOU TRY: Similar to QUESTION 4-6, use the mean() function to calculate the joint probability of a flower having both wide sepal (`Wide_Sepal == 1`) and long petal (`Long_Petal == 1`).
mean(iris$Wide_Sepal == 1 & iris$Long_Petal == 1)
## [1] 0.1666667

QUESTION 11: Based on your code above, are iris flowers more likely to have a wide sepal and long petal OR are they more likely to have a wide sepal and short petal? Do the probabilities in your R code seem to agree with the contingency table? Explain.

ANSWER 11: It is more likely for a flower to have a wide sepal and long petal.


PART 5: The Addition Rule

The addition rule is used to find the probability of either one of two events occurring. It can be applied when we want to calculate the probability of “either A or B” happening. The general formula for the addition rule is:

P(A or B)=P(A)+P(B)−P(A and B)

This formula ensures that we don’t double-count the overlap between events A and B, which is the probability of both events happening at the same time. When two events are mutually exclusive (meaning they cannot occur simultaneously), P(A and B) will be zero, and the formula simplifies to:

P(A or B)=P(A)+P(B)

Example: We will apply the addition rule to calculate the probability of either having a wide sepal or a long petal. In R, we can do this by using the | sign. The | sign tells R to calculate the probability of A or B:

# PLAY ME: Calculate the probability of a flower having a wide sepal or a long petal using the addition rule
mean(iris$Wide_Sepal == 1 | iris$Long_Petal == 1)
## [1] 0.84

QUESTION 12: What is the probability of either having a wide sepal or a long petal?

ANSWER 12: 0.88

QUESTION 13: If the events were mutually exclusive (meaning no overlap between the two events), how could you adjust your calculation? Provide a brief explanation.

ANSWER 13: 0.67

Let’s put your work in QUESTION 13 to the test. Two events that should be mutually exclusive are the probability that the flower is of the “setosa” species and the probability that the flower is of the “virginica” species (a flower can only be one species).

# NOW YOU TRY: Similar to the code for QUESTION 12, calculate the probability of iris$Species == "setosa" OR iris$Species == "virginica" using the | symbol inside of the mean() function.
mean(iris$Wide_Sepal == 1 | iris$Long_Petal == 1)
## [1] 0.84

Next, calculate the simple probability of a flower is of the “setosa” species using the mean() function. Then calculate the simple probability of a flower being of the “virginica” species using the mean() function. Use the addition rule for mutually exclusive events (shown above) to calculate the probability that a flower is of the “setosa” species OR the “virginica” species.

# NOW YOU TRY: Use the mean() function to calculate the simple probability of a flower being of the "setosa" species. Do the same for the simple probability of a flower being of the "virginica" species. Add these two probabilities together.
mean(iris$Species == "setosa" | iris$Species == "virginica")
## [1] 0.6666667

QUESTION 14: You should have calculated 0.6666 in both R chunks above. Does this confirm that these events are mutually exclusive? Why or why not?

ANSWER 14: 0.6


PART 6: Conditional Probability and Independence

Conditional probability is the probability of one event happening given that another event has already occurred. It is denoted as:

P(A∣B)=P(A and B)/P(B)

Where: - P(A∣B) is the probability of event A occurring given that event B has occurred.

Let’s see this in action. We will calculate the probability that a flower has a long petal conditional on it having a wide sepal:

# PLAY ME: Conditional probability of Long_Petal given Wide_Sepal
mean(iris$Long_Petal == 1 & iris$Wide_Sepal == 1) / mean(iris$Wide_Sepal == 1)
## [1] 0.3731343
# NOW YOU TRY: Calculate the probability of a Long_Petal == 1 conditional on Species == "virginica". Hint: This is a similar process to the chunk directly above.

QUESTION 17: Is is likely that a flower has a long petal if it is of the “virginica” species? Why or why not?

ANSWER 17:


PART 7: The Multiplication Rule

The Multiplication Rule is used to find the probability of two events occurring together (i.e., the joint probability of two events). If two events are independent, the joint probability can be calculated by multiplying their individual probabilities.

For example, if Event A is “the flower has a wide sepal” and Event B is “the flower has a long petal,” we can calculate the joint probability of both events happening if we assume these events are independent.

If two events are independent, the formula for the multiplication rule is:

P(A and B) = P(A) * P(B)

Example: Calculate the probability of both having a wide sepal and a long petal, assuming independence.

# PLAY ME: Using the multiplication rule to calculate the joint probability of wide sepal and long petal (assuming independence)
mean(iris$Wide_Sepal == 1) * mean(iris$Long_Petal == 1)
## [1] 0.2501333

QUESTION 18: Based on the multiplication rule, what is the probability of an iris flower having both a wide sepal and a long petal, assuming independence?

ANSWER 18: 0.28

# NOW YOU TRY: Recalculate the joint probability of a flower having both a wide sepal and a short petal (Wide_Sepal == 1 and Long_Petal == 0), assuming independence.
mean(iris$Wide_Sepal == 1) * mean(iris$Long_Petal == 0)
## [1] 0.1965333

QUESTION 19: What is the joint probability of an iris flower having a wide sepal and short petal, assuming indepdennce?

ANSWER 19: 0.22


PART 8: Bayes’ Theorem

Bayes’ Theorem allows us to reverse conditional probabilities. It helps us calculate the probability of an event happening based on prior knowledge of conditions that might be related to the event. One way to write Bayes’ Theorem is the following:

P(A|B) = (P(B|A)*P(A))/P(B)

Let’s use Bayes’ Theorem to calculate the probability of having a wide sepal given that the flower has a long petal.

# PLAY ME: Calculate the required probabilities for Bayes' Theorem
P_A <- mean(iris$Wide_Sepal == 1)                    # P(A): Probability of having a wide sepal
P_B <- mean(iris$Long_Petal == 1)                    # P(B): Probability of having a long petal
P_B_given_A <- mean(iris$Long_Petal == 1 & iris$Wide_Sepal == 1) / mean(iris$Wide_Sepal == 1)  # P(B|A): Probability of having a long petal given a wide sepal

# Applying Bayes' Theorem to calculate P(A|B)
P_A_given_B <- (P_B_given_A * P_A) / P_B
P_A_given_B
## [1] 0.297619

QUESTION 20: Based on the calculation using Bayes’ Theorem, what is the probability of an iris flower having a wide sepal given that it has a long petal?

ANSWER 20: 0.75

# NOW YOU TRY: Use Bayes' Theorem to calculate the probability that an iris flower has a wide sepal given it is of the "setosa" species.

QUESTION 21: Based on the code above, what is the probability that an iris flower has a wide sepal given that it is of the species “setosa”?

ANSWER 21: 0.4

# NOW YOU TRY: Now use the conditional probability formula to calculate the probability that an iris flower has a wide sepal given it is of the "setosa" species.

QUESTION 22: Based on the code above, what is the probability that an iris flower has a wide sepal given that it is of the species “setosa”? How does this probability compare to the one you calculated in QUESTION 21 using Bayes’ theorem?

ANSWER 22: 0.7

PART 9: Counting Rules

In probability, Counting Rules allow us to calculate the number of ways to select or arrange items. These rules are crucial when determining how many possible outcomes can occur in various scenarios. In this part of the assignment, we’ll focus on combinations since we’re interested in choosing flowers without worrying about the order.

Combinations

Combinations are used when the order does not matter. For example, if you want to choose a few flowers from the iris dataset, the order in which you pick them is irrelevant—you’re just selecting a group.

The formula for combinations is:

C(n,r) = n!/(r!(n-r)!)

Where: - n is the total number of items (in this case, flowers) - r is the number of items you want to choose - ! (factorial) means multiplying all positive integers from 1 to that number (for example, 5! = (5)(4)(3)(2)(1) = 120)

Example: How many ways can you choose 3 flowers from 150?

Imagine you’re interested in selecting 3 flowers from the 150 flowers in the iris dataset. The order of selection does not matter here – you’re just picking a set of 3 flowers.

We can calculate this using the choose() function in R, which calculates combinations.

# PLAY ME: Use the choose() function to calculate the number of ways to choose 3 flowers from 150
choose(150,3)
## [1] 551300

QUESTION 23: Based on the counting rule for combinations, how many different combinations of 3 flowers can be selected from the 150 flowers in the dataset?

ANSWER 23: 551,300

# NOW YOU TRY: Use the choose() function to calculate how many different ways you can choose 4 flowers from the 150 flowers in the dataset.
choose(150,4)
## [1] 20260275

QUESTION 24: Based on your calculation, how does the number of combinations change when choosing 4 flowers instead of 3 from the dataset?

ANSWER 24: 404,250,150

Permutations

In some scenarios, order matters, and that’s when permutations come into play. For example, if we wanted to arrange the flowers in a particular order (say for display), the order of selection would be important.

The formula for permutations is:

P(n,r) = n!/(n-r)!

Where: - n is the total number of items - r is the number of items you want to choose - ! (factorial) means multiplying all positive integers from 1 to that number

Example: If you wanted to arrange 3 flowers out of 150 in a specific order, you would use the perm() function (or similar calculations in R) to find how many ways you could do that.

# PLAY ME: Use a custom function for permutations
perm <- function(n, r) { factorial(n) / factorial(n - r) }
perm(150, 3)
## [1] 3307800

QUESTION 25: Does order matter for permutations or combinations? Explain the difference.

ANSWER 25: Combinations: Order does not matter. We only care about the selection of items. Permutations: Order matters. We care about the arrangement of items.

QUESTION 26: Were there more permutations or combinations when we pick 3 flowers from 150 possible flowers? Why is it important to consider whether the order of items matters?

ANSWER 26: There are more permutations than combinations, as permutations account for the order of selection, while combinations do not.