This lab assignment aims to reinforce your understanding of descriptive statistics, calculating probabilities, and identifying sample spaces. You will apply these concepts to practical problems using R.
Task: Given a dataset of reaction times, calculate the mean, median, mode, variance, standard deviation, and identify any outliers.
Dataset: Use the following reaction times (in
milliseconds) for the analysis:
c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290).
# Sample data vector
reaction_times <- c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290)## [1] 311.75
## [1] 302.5
# Calculate mode
get_mode <- function(x) {
uniqv <- unique(x)
uniqv[which.max(tabulate(match(x, uniqv)))]
}
get_mode(reaction_times)## [1] 295
## [1] 1113.882
## [1] 33.37486
The variance is 113.89. This value is in squared units and is difficult to interpret as is. Taking the square root of it gives us the standard deviation. The standard deviation is 33.37. This means that the majority of reaction times are 33ms below or about the mean (312). The majority of reaction times are between 279ms and 345ms.
# Adding an outlier
reaction_times_outlier <- c(250, 340, 295, 305, 285, 330, 365, 300, 310, 290, 295, 285, 360, 370, 275, 325, 335, 350, 280, 290, 900)
# Calculate quartiles and other statistics
Q1 <- quantile(reaction_times_outlier, 0.25)
print(paste("Quantile 1:", Q1))## [1] "Quantile 1: 290"
## [1] "Quantile 3: 340"
## [1] "Quantile 3 - Quantile 1: 50"
## [1] "Median: 305"
These formulas establish the boundaries for identifying outliers in a dataset using what statisticians call the “1.5 × IQR rule.” Any data points that fall below the lower bound or above the upper bound are considered outliers.
Starting with the Box: The box in a boxplot represents the middle 50% of your data (from Q1 to Q3).
Extending Beyond the Box: We want to determine how far beyond the box a value can go before we consider it unusual.
The Multiplier (1.5): The factor 1.5 is a statistical convention that creates what we call “whiskers” on the boxplot.
These whiskers extend 1.5 times the IQR from each quartile. This creates a reasonable range for “normal” data.
The choice of 1.5 as the multiplier is based on statistical theory and practical experience:
Statistical Properties: Under a normal distribution, approximately 99.3% of the data falls within these bounds. This means only about 0.7% of values would be flagged as outliers in normally distributed data.
Balance: The value 1.5 provides a good balance between:
Being too sensitive (flagging too many points as outliers)
Being too lenient (missing actual outliers)
Historical Precedent: John Tukey, who developed the boxplot, found through empirical research that 1.5 worked well across many different types of data.
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
print(paste("Lower Bound:", lower_bound))## [1] "Lower Bound: 215"
## [1] "Upper Bound: 415"
# Create a boxplot
boxplot(reaction_times_outlier,
main = "Reaction Times with Labeled Quantiles",
ylab = "Time (ms)",
ylim = c(min(reaction_times_outlier, lower_bound) - 50, max(reaction_times_outlier, upper_bound) + 50),
outline = TRUE,
col = "lightblue")
# Add horizontal dashed lines for Q1 and Q3 to connect these lines across the box
abline(h = Q1, col = "darkblue", lty = 2)
abline(h = Q3, col = "darkblue", lty = 2)
# Add annotations for quantiles
text(x = 1.3, y = Q1, labels = paste("Q1 =", Q1), pos = 4, col = "darkblue")
text(x = 1.2, y = median_val, labels = paste("Median =", median_val), pos = 4, col = "darkgreen")
text(x = 1.3, y = Q3, labels = paste("Q3 =", Q3), pos = 4, col = "darkblue")
text(x = 1.2, y = lower_bound, labels = paste("Lower bound =", round(lower_bound, 1)), pos = 4, col = "red")
text(x = 1.2, y = upper_bound, labels = paste("Upper bound =", round(upper_bound, 1)), pos = 4, col = "red")
# Add a line for the IQR within the box
segments(0.8, Q1, 0.8, Q3, col = "purple", lwd = 3)
text(x = 0.6, y = (Q1 + Q3) / 2, labels = paste("IQR =", IQR_value), pos = 2, col = "purple")
# Highlight outliers
outliers <- reaction_times_outlier[reaction_times_outlier > upper_bound | reaction_times_outlier < lower_bound]
if(length(outliers) > 0) {
text(x = 1, y = outliers, labels = paste("Outlier"), pos = 4, col = "red", cex = 0.9)
}
# Connect the annotation texts for Q1 and Q3 with the boxplot using line segments
# For Q1
segments(x0 = 1.05, y0 = Q1, x1 = 1, y1 = Q1, col = "darkblue", lwd = 2, lty = 2)
# For Q3
segments(x0 = 1.05, y0 = Q3, x1 = 1, y1 = Q3, col = "darkblue", lwd = 2, lty = 2)## [1] 900
The quantiles are useful for identifying outliers because they split our data up into 4 “chunks” (quantiles). By using these quantiles and probability theory, we can calculate our upper and lower bounds. This lets us identidy values that are either extremely high or extremely low. In this example, the outlier was 900.
Task: Assume reaction times in a cognitive task follow a normal distribution with a mean of 300 ms and a standard deviation of 50 ms. Calculate the probability that a randomly selected individual has a reaction time:
Less than 250 ms.
Between 250 ms and 350 ms.
More than 400 ms.
## [1] 0.1586553
# Probability of a reaction time between 250 ms and 350 ms
pnorm(250, mean, sd) - pnorm(250, mean, sd)## [1] 0
## [1] 0.02275013
The probability of a reaction time less than 250ms is 16%. The probability of a reaction time between 250ms and 350ms is 68%. The probability of a reaction time of 400ms or greater is 2%.
## [1] 0.9571903
## [1] 0.6493833
The probability of a value less than 2 is 96%. The probability of a value between 1 and -1 is 65%.