Replace “Your Name” with your actual name.

Objective:

This lab assignment aims to reinforce your understanding of descriptive statistics, calculating probabilities, and identifying sample spaces. You will apply these concepts to practical problems using R.

Homework Exercises:

Exercise 1: Analyzing Descriptive Statistics

  • Task: Given a dataset of reaction times, calculate the mean, median, mode, variance, standard deviation, and identify any outliers.

  • Dataset: Use the following reaction times (in milliseconds) for the analysis: c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290).

  • Instructions:

    1. Calculate the mean, median, and mode.

    2. Calculate the variance and standard deviation.

    3. Identify any outliers using the IQR method.

    4. Write the R code to perform these calculations and interpret the results.

# Sample data vector
reaction_times <- c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290)
# Calculate mean
mean(reaction_times)
## [1] 311.75
# Calculate median
median(reaction_times)
## [1] 302.5
# Calculate mode
get_mode <- function(x) {
  uniqv <- unique(x)
  uniqv[which.max(tabulate(match(x, uniqv)))]
}

get_mode(reaction_times)
## [1] 295
# Calculate variance
var(reaction_times)
## [1] 1113.882
sd(reaction_times)
## [1] 33.37486
# Calculate standard deviation

The variance is 1113.89. This value is n squared units and is difficult to interpret as is. Taking the square root of it gives us the standard deviation. The standard deviation is 33.37. This means that the majority of reaction times are 33ms below or above the mean (312).The majority of reaction times are between 279ms and 345ms.

# Adding an outlier
reaction_times_outlier <- c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290, 900)

# Calculate quartiles and other statistics
Q1 <- quantile(reaction_times_outlier, 0.25)
print(paste("Quantile 1:", Q1))
## [1] "Quantile 1: 290"
Q3 <- quantile(reaction_times_outlier, 0.75)
print(paste("Quantile 3:", Q3))
## [1] "Quantile 3: 340"
IQR_value <- IQR(reaction_times_outlier)
print(paste("Quantile 3 - Quantile 1:", IQR_value))
## [1] "Quantile 3 - Quantile 1: 50"
median_val <- median(reaction_times_outlier)
print(paste("Median:", median_val))
## [1] "Median: 305"

What These Formulas Do

These formulas establish the boundaries for identifying outliers in a dataset using what statisticians call the “1.5 × IQR rule.” Any data points that fall below the lower bound or above the upper bound are considered outliers.

Breaking It Down

Starting with the Box: The box in a boxplot represents the middle 50% of your data (from Q1 to Q3).

Extending Beyond the Box: We want to determine how far beyond the box a value can go before we consider it unusual.

The Multiplier (1.5): The factor 1.5 is a statistical convention that creates what we call “whiskers” on the boxplot.

These whiskers extend 1.5 times the IQR from each quartile. This creates a reasonable range for “normal” data.

Why 1.5?

The choice of 1.5 as the multiplier is based on statistical theory and practical experience:

  1. Statistical Properties: Under a normal distribution, approximately 99.3% of the data falls within these bounds. This means only about 0.7% of values would be flagged as outliers in normally distributed data.

  2. Balance: The value 1.5 provides a good balance between:

    • Being too sensitive (flagging too many points as outliers)

    • Being too lenient (missing actual outliers)

  3. Historical Precedent: John Tukey, who developed the boxplot, found through empirical research that 1.5 worked well across many different types of data.

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

print(paste("Lower Bound:", lower_bound))
## [1] "Lower Bound: 215"
print(paste("Upper Bound:", upper_bound))
## [1] "Upper Bound: 415"
# Create a boxplot
boxplot(reaction_times_outlier, 
        main = "Reaction Times with Labeled Quantiles",
        ylab = "Time (ms)",
        ylim = c(min(reaction_times_outlier, lower_bound) - 50, max(reaction_times_outlier, upper_bound) + 50),
        outline = TRUE,
        col = "lightblue")

# Add horizontal dashed lines for Q1 and Q3 to connect these lines across the box
abline(h = Q1, col = "darkblue", lty = 2)
abline(h = Q3, col = "darkblue", lty = 2)

# Add annotations for quantiles
text(x = 1.3, y = Q1, labels = paste("Q1 =", Q1), pos = 4, col = "darkblue")
text(x = 1.2, y = median_val, labels = paste("Median =", median_val), pos = 4, col = "darkgreen")
text(x = 1.3, y = Q3, labels = paste("Q3 =", Q3), pos = 4, col = "darkblue")
text(x = 1.2, y = lower_bound, labels = paste("Lower bound =", round(lower_bound, 1)), pos = 4, col = "red")
text(x = 1.2, y = upper_bound, labels = paste("Upper bound =", round(upper_bound, 1)), pos = 4, col = "red")

# Add a line for the IQR within the box
segments(0.8, Q1, 0.8, Q3, col = "purple", lwd = 3)
text(x = 0.6, y = (Q1 + Q3) / 2, labels = paste("IQR =", IQR_value), pos = 2, col = "purple")

# Highlight outliers
outliers <- reaction_times_outlier[reaction_times_outlier > upper_bound | reaction_times_outlier < lower_bound]
if(length(outliers) > 0) {
  text(x = 1, y = outliers, labels = paste("Outlier"), pos = 4, col = "red", cex = 0.9)
}

# Connect the annotation texts for Q1 and Q3 with the boxplot using line segments
# For Q1
segments(x0 = 1.05, y0 = Q1, x1 = 1, y1 = Q1, col = "darkblue", lwd = 2, lty = 2)
# For Q3
segments(x0 = 1.05, y0 = Q3, x1 = 1, y1 = Q3, col = "darkblue", lwd = 2, lty = 2)

reaction_times_outlier[reaction_times_outlier < lower_bound | reaction_times_outlier > upper_bound]
## [1] 900

The Quantiles are useful for identifying outliers because they split our data up into 4 “chunks” (quantilies). By using these quantiles and probability theory. We can calculate our upper and lower bounds. This lets us identify values that are either extremely high or extremely low. In this example, the outlier was 900.

Exercise 2: Calculating Probabilities with the Normal Distribution

  • Task: Assume reaction times in a cognitive task follow a normal distribution with a mean of 300 ms and a standard deviation of 50 ms. Calculate the probability that a randomly selected individual has a reaction time:

    1. Less than 250 ms.

    2. Between 250 ms and 350 ms.

    3. More than 400 ms.

  • Instructions:

    1. Use the pnorm function in R to calculate these probabilities.

    2. Write the R code for these calculations and explain what each probability means.

# Parameters
mean <- 300
sd <- 50
# Probability of a reaction time less than 250 ms
pnorm(250, mean, sd)
## [1] 0.1586553
# Probability of a reaction time between 250 ms and 350 ms
pnorm(350, mean, sd) - pnorm(250, mean, sd)
## [1] 0.6826895
# Probability of a reaction time more than 400 ms
1 - pnorm(400, mean, sd)
## [1] 0.02275013

The probability of a reaction time less than 250ms is 16% The probability of a reaction time between 250ms and 350ms is 68%. The probability of a reaction time of 400ms or greater is 2%.

Exercise 3: Applying the T-Distribution

  • Task: You conducted a small-scale study with 8 participants measuring their anxiety levels on a scale of 1 to 10. Calculate the probability of a t-score being less than 2 and between -1 and 1 using the t-distribution.

  • Instructions:

    1. Define the degrees of freedom for your study.

    2. Use the pt function in R to calculate these probabilities.

    3. Write the R code for these calculations and discuss how the results might differ if a normal distribution were assumed.

# Degrees of freedom
df <- 7  # for n = 8, df = n - 1
# Probability of a t-score less than 2
pt(2, df)
## [1] 0.9571903
# Probability of a t-score between -1 and 1
pt(1, df) - pt(-1, df)
## [1] 0.6493833

The probability of a value less than 2 is 96%. The probability of a value between 1 and -1 is 65%.

Submission Instructions:

Ensure to knit your document to PDF format, checking that all content is correctly displayed before submission. Submit this PDF to Canvas Assignments.