Weekly Lab Homework Assignment: Descriptive Statistics and Probability

Objective:

This lab assignment aims to reinforce your understanding of descriptive statistics, calculating probabilities, and identifying sample spaces. You will apply these concepts to practical problems using R.

Homework Exercises:

Exercise 1: Analyzing Descriptive Statistics

Task: Given a dataset of reaction times, calculate the mean, median, mode, variance, standard deviation, and identify any outliers.
Dataset: Use the following reaction times (in milliseconds) for the analysis: c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290).

# Sample data vector
reaction_times <- c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290)

# Calculate mean
mean(reaction_times)

## [1] 311.75

# Calculate median
median(reaction_times)

## [1] 302.5

# Calculate mode
get_mode <- function(x) {
  uniqv <- unique(x)
  uniqv[which.max(tabulate(match(x, uniqv)))]
}

get_mode(reaction_times)

## [1] 295

# Calculate variance
var(reaction_times)

## [1] 1113.882

# Calculate standard deviation
sd(reaction_times)

## [1] 33.37486

The variance is 113.89. This value is in squared units and is difficult to interpret as is. Taking the square root of it gives us the standard deviation. The standard deviation is 33.37. This means that the majority of reaction times are 33ms below or about the mean (312). The majority of reaction times are between 279ms and 345ms.

# Adding an outlier
reaction_times_outlier <- c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290, 900)

# Calculate quartiles and other statistics
Q1 <- quantile(reaction_times_outlier, 0.25)
print(paste("Quantile 1:", Q1))

## [1] "Quantile 1: 290"

Q3 <- quantile(reaction_times_outlier, 0.75)
print(paste("Quantile 3:", Q3))

## [1] "Quantile 3: 340"

IQR_value <- IQR(reaction_times_outlier)
print(paste("Quantile 3 - Quantile 1:", IQR_value))

## [1] "Quantile 3 - Quantile 1: 50"

median_val <- median(reaction_times_outlier)
print(paste("Median:", median_val))

## [1] "Median: 305"

What These Formulas Do

These formulas establish the boundaries for identifying outliers in a dataset using what statisticians call the “1.5 × IQR rule.” Any data points that fall below the lower bound or above the upper bound are considered outliers.

Breaking It Down

Starting with the Box: The box in a boxplot represents the middle 50% of your data (from Q1 to Q3).

Extending Beyond the Box: We want to determine how far beyond the box a value can go before we consider it unusual.

The Multiplier (1.5): The factor 1.5 is a statistical convention that creates what we call “whiskers” on the boxplot.

These whiskers extend 1.5 times the IQR from each quartile. This creates a reasonable range for “normal” data.

Why 1.5?

The choice of 1.5 as the multiplier is based on statistical theory and practical experience:

Statistical Properties: Under a normal distribution, approximately 99.3% of the data falls within these bounds. This means only about 0.7% of values would be flagged as outliers in normally distributed data.
Balance: The value 1.5 provides a good balance between:
- Being too sensitive (flagging too many points as outliers)
- Being too lenient (missing actual outliers)
Historical Precedent: John Tukey, who developed the boxplot, found through empirical research that 1.5 worked well across many different types of data.

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

print(paste("Lower Bound:", lower_bound))

## [1] "Lower Bound: 215"

print(paste("Upper Bound:", upper_bound))

## [1] "Upper Bound: 415"

# Create a boxplot
boxplot(reaction_times_outlier, 
        main = "Reaction Times with Labeled Quantiles",
        ylab = "Time (ms)",
        ylim = c(min(reaction_times_outlier, lower_bound) - 50, max(reaction_times_outlier, upper_bound) + 50),
        outline = TRUE,
        col = "lightblue")

# Add horizontal dashed lines for Q1 and Q3 to connect these lines across the box
abline(h = Q1, col = "darkblue", lty = 2)
abline(h = Q3, col = "darkblue", lty = 2)

# Add annotations for quantiles
text(x = 1.3, y = Q1, labels = paste("Q1 =", Q1), pos = 4, col = "darkblue")
text(x = 1.2, y = median_val, labels = paste("Median =", median_val), pos = 4, col = "darkgreen")
text(x = 1.3, y = Q3, labels = paste("Q3 =", Q3), pos = 4, col = "darkblue")
text(x = 1.2, y = lower_bound, labels = paste("Lower bound =", round(lower_bound, 1)), pos = 4, col = "red")
text(x = 1.2, y = upper_bound, labels = paste("Upper bound =", round(upper_bound, 1)), pos = 4, col = "red")

# Add a line for the IQR within the box
segments(0.8, Q1, 0.8, Q3, col = "purple", lwd = 3)
text(x = 0.6, y = (Q1 + Q3) / 2, labels = paste("IQR =", IQR_value), pos = 2, col = "purple")

# Highlight outliers
outliers <- reaction_times_outlier[reaction_times_outlier > upper_bound | reaction_times_outlier < lower_bound]
if(length(outliers) > 0) {
  text(x = 1, y = outliers, labels = paste("Outlier"), pos = 4, col = "red", cex = 0.9)
}

# Connect the annotation texts for Q1 and Q3 with the boxplot using line segments
# For Q1
segments(x0 = 1.05, y0 = Q1, x1 = 1, y1 = Q1, col = "darkblue", lwd = 2, lty = 2)
# For Q3
segments(x0 = 1.05, y0 = Q3, x1 = 1, y1 = Q3, col = "darkblue", lwd = 2, lty = 2)

reaction_times_outlier[reaction_times_outlier < lower_bound | reaction_times_outlier > upper_bound]

## [1] 900

The quantiles are useful for identifying outliers because they split our data up into 4 “chunks” (quantiles). By using these quantiles and probability theory, we can calculate our upper and lower bounds. This lets us identidy values that are either extremely high or extremely low. In this example, the outlier was 900.

Exercise 2: Calculating Probabilities with the Normal Distribution

Task: Assume reaction times in a cognitive task follow a normal distribution with a mean of 300 ms and a standard deviation of 50 ms. Calculate the probability that a randomly selected individual has a reaction time:
1. Less than 250 ms.
2. Between 250 ms and 350 ms.
3. More than 400 ms.

# Parameters
mean <- 300
sd <- 50

# Probability of a reaction time less than 250 ms
pnorm(250, mean, sd)

## [1] 0.1586553

# Probability of a reaction time between 250 ms and 350 ms
pnorm(250, mean, sd) - pnorm(250, mean, sd)

## [1] 0

# Probability of a reaction time more than 400 ms
1 - pnorm(400, mean, sd)

## [1] 0.02275013

The probability of a reaction time less than 250ms is 16%. The probability of a reaction time between 250ms and 350ms is 68%. The probability of a reaction time of 400ms or greater is 2%.

Exercise 3: Applying the T-Distribution

Task: You conducted a small-scale study with 8 participants measuring their anxiety levels on a scale of 1 to 10. Calculate the probability of a t-score being less than 2 and between -1 and 1 using the t-distribution.

# Degrees of freedom
df <- 7  # for n = 8, df = n - 1

# Probability of a t-score less than 2
pt(2, df)

## [1] 0.9571903

# Probability of a t-score between -1 and 1
pt(1, df) - pt(-1, df)

## [1] 0.6493833

The probability of a value less than 2 is 96%. The probability of a value between 1 and -1 is 65%.