In computer science, especially at the postgraduate level, research goes beyond building systems. It must involve evaluating, validating, and communicating those systems’ effectiveness using data.
This is where statistics becomes essential.
Statistics provides the language of evidence in science. It transforms raw performance numbers into scientifically defensible claims.
In computer science research, especially at the postgraduate level, an observation alone is not evidence. Without statistical analysis, a claim such as:
“I ran my new sorting algorithm and it seemed faster than QuickSort.”
is anecdotal at best. It lacks the rigor required for scientific validation.
In a thesis defense, such a statement will immediately prompt scrutiny. Examiners are trained to ask questions like:
These are not simply technical questions — they are statistical questions. Without a solid understanding of statistical tools, it is difficult to answer them convincingly.
In short, statistics is what transforms a set of observations into a defensible research finding. It allows you to quantify uncertainty, test assumptions, justify claims, and communicate results clearly — all of which are essential for academic credibility.
In computer science research, statistical methods are used for:
Research Activity | Statistical Role |
---|---|
Algorithm performance comparison | t-tests, ANOVA, boxplots |
Classifier accuracy evaluation | Confidence intervals, hypothesis testing |
Survey analysis | Chi-squared tests, proportions, ranking |
Input/output performance modeling | Regression, correlation |
Software/system reliability | Error rate analysis, standard deviation |
To make the statistical concepts in this lecture more tangible and relevant to your experience as a postgraduate student, we will return repeatedly to a single research scenario — one that resembles the kind of problem you might actually face in your thesis.
You are defending your MSc thesis in Computer Science.
Your research investigates the performance of two well-known sorting algorithms:
MergeSort and QuickSort.
You claim that MergeSort is more efficient, and you want to prove this claim based on experimental data.
To support your hypothesis, you design a controlled experiment:
This results in a dataset containing 120 numerical measurements — 60 from each algorithm — which will form the basis for all the statistical techniques explored in this course.
To demonstrate statistical concepts in this lecture, we will simulate a controlled experiment involving two sorting algorithms: MergeSort and QuickSort.
You are a postgraduate student evaluating the performance of these algorithms for your thesis. You want to answer the research question:
Is MergeSort significantly faster than QuickSort on average?
To explore this, you conduct the following experiment:
These measurements will form the basis of our statistical analysis throughout the lecture.
We use the a random function to simulate realistic, normally distributed data with slight differences in mean and variability.
Run | QuickSort (ms) | MergeSort (ms) |
---|---|---|
1 | 60.23 | 50.28 |
2 | 48.61 | 51.52 |
3 | 54.18 | 53.18 |
4 | 55.80 | 44.96 |
5 | 54.43 | 50.52 |
6 | 51.36 | 39.41 |
7 | 61.07 | 44.08 |
8 | 51.43 | 43.75 |
9 | 64.11 | 35.93 |
10 | 51.62 | 48.18 |
11 | 59.83 | 49.03 |
12 | 65.72 | 46.19 |
13 | 43.67 | 51.79 |
14 | 50.33 | 44.37 |
15 | 51.20 | 41.16 |
16 | 55.82 | 50.16 |
17 | 50.29 | 43.94 |
18 | 36.06 | 55.22 |
19 | 37.36 | 45.84 |
20 | 59.92 | 51.28 |
Note on Data Format: While the table above shows the data in a wide format (QuickSort and MergeSort side-by-side) for clarity, our actual dataset is stored in long format.
This allows us to easily calculate group-wise statistics, create
plots, and run statistical tests using tidy R tools such as
t.test()
, aggregate()
, and
ggplot2
, that we will learn about them later in this
lecture. So the data is actaully in this format
print(head(algo_data))
## Algorithm Time_ms
## 1 QuickSort 60.22575
## 2 QuickSort 48.61181
## 3 QuickSort 54.17877
## 4 QuickSort 55.79718
## 5 QuickSort 54.42561
## 6 QuickSort 51.36325
This dataset contains: - A column Algorithm
indicating
which sorting algorithm was used
- A column Time_ms
representing the measured execution time
for that run
You now have a clean and realistic dataset with which to explore: -
Measures of central tendency and spread
- Visualizations
- Assumption checking
- Confidence intervals
- Hypothesis testing
The central question you aim to answer is:
“Is MergeSort significantly faster than QuickSort?”
To answer this convincingly, you cannot rely on visual inspection or gut feeling. You must:
In other words, your thesis defense becomes a statistical investigation — one in which every concept we learn will help you build a scientifically credible argument.
By the end of this crash course, you will be able to:
When analyzing data, one of the first questions we often ask is:
“Where is the center of the data?”
Central tendency refers to the concept of a typical or central value around which data tend to cluster. It is a foundational idea in statistics, providing a numerical summary of an entire dataset using a single value.
The three most common measures of central tendency are:
Each has strengths and weaknesses depending on the shape and nature of the data.
The mean is the sum of all values divided by the number of values:
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
It is widely used due to its simplicity and because many statistical techniques are based on it. However, the mean is sensitive to outliers. A single extreme value can distort the result significantly.
Example: If most run-times are around 50ms, but one test took 200ms due to a system hiccup, the mean will be pulled upward.
The median is the middle value when all observations are ordered. If the number of observations is even, it is the average of the two middle values.
It is not affected by outliers, which makes it a more robust indicator of central tendency in skewed distributions.
Example: In the presence of an unusually long run-time in a generally stable set of results, the median may reflect the “typical” behavior more faithfully than the mean.
The mode is the value that appears most frequently in a dataset.
Let’s use our experimental dataset comparing QuickSort and MergeSort execution times.
mean(algo_data$Time_ms) # Overall mean
## [1] 49.90099
median(algo_data$Time_ms) # Overall median
## [1] 50.28587
# Mode is not defined directly for continuous data, but we can approximate:
table(round(algo_data$Time_ms)) # Frequency table of rounded values
##
## 33 36 37 39 41 44 45 46 48 49 50 51 52 53 54 55 56 59 60 61 63 64 66
## 1 2 1 1 3 6 1 2 4 5 7 8 3 1 2 2 3 1 3 1 1 1 1
When preparing your thesis: - Use the mean if your data are normally distributed and free of outliers. - Use the median if your data are skewed or contain outliers. - Use the mode only for categorical or ordinal variables.
Try to pair numerical summaries with visualizations to get a fuller picture.
While central tendency tells us where the data are centered, it does not tell us how spread out the values are around that center. In research, especially in performance analysis, knowing whether values are consistently close to the mean or widely scattered is crucial.
Variability measures help us quantify how much uncertainty or inconsistency exists in our data.
In this section, we will introduce four core concepts:
Each concept builds upon the previous, leading to a deeper understanding of the data’s behavior.
The range is the simplest measure of spread:
\[ \text{Range} = \text{Maximum} - \text{Minimum} \]
It gives the full span of values but is very sensitive to outliers, since it depends only on the two extreme points.
range(algo_data$Time_ms)
## [1] 33.03455 65.71987
diff(range(algo_data$Time_ms)) # Actual range value
## [1] 32.68532
This tells us the total spread of execution times across both algorithms. However, it does not tell us how values are distributed within this range.
To get a deeper sense of spread, we consider how far each observation is from the mean:
\[ x_i - \bar{x} \]
Where: - \(x_i\) is each value - \(\bar{x}\) is the sample mean
These are called deviations. They reflect the distance between each value and the center of the distribution.
However, if we add all the deviations, they sum to zero due to symmetry:
mean(algo_data$Time_ms)
## [1] 49.90099
sum(algo_data$Time_ms - mean(algo_data$Time_ms)) # Always zero
## [1] -6.394885e-14
This leads us to square the deviations to avoid cancellation — and that gives us the variance.
Variance is the average of the squared deviations from the mean:
\[ s^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
tapply(algo_data$Time_ms, algo_data$Algorithm, var)
## MergeSort QuickSort
## 27.56546 56.70344
Interpretation:
Variance gives us a numerical sense of spread, but its unit is
squared, making it harder to interpret directly (e.g.,
ms²).
The standard deviation is simply the square root of the variance:
\[ s = \sqrt{s^2} \]
It expresses variability in the same units as the original data, making it far more interpretable.
tapply(algo_data$Time_ms, algo_data$Algorithm, sd)
## MergeSort QuickSort
## 5.250282 7.530168
Interpretation:
The standard deviation tells us, on average, how far values
deviate from the mean. A higher SD indicates greater
variability (i.e., less consistency), while a lower SD implies tighter
clustering.
Let’s visually compare two distributions that have the same mean, but different spreads.
This illustrates: - Higher SD leads to a wider, flatter distribution - Lower SD results in a narrower, more peaked shape
Even with the same average, variability changes how reliable and predictable the data are.
While numerical summaries (like mean and standard deviation) are useful, they can sometimes hide important details about how data behave. Visualizing the distribution of your data allows you to:
In this section, we will explore the two most important tools for visualizing distributions: - Histograms - Boxplots
We will also introduce supporting concepts: quartiles, interquartile range (IQR), and outlier detection.
A histogram shows how data are distributed across intervals (called bins). It answers the question:
“How many values fall within each range of the variable?”
Each bar represents a range of values, and its height indicates how many observations fall within that range.
Limitation: Histograms don’t handle group comparisons well — for that, we need boxplots.
Before we explore boxplots, we must understand quartiles, which divide a dataset into four equal parts:
The interquartile range (IQR) is defined as:
\[ \text{IQR} = Q_3 - Q_1 \]
It measures the middle 50% of the data, excluding extremes. It’s a robust alternative to standard deviation when data are skewed or contain outliers.
You can compute quartiles and IQR in R using:
summary(algo_data$Time_ms) # includes Q1, Median, Q3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.03 45.62 50.29 49.90 53.43 65.72
IQR(algo_data$Time_ms)
## [1] 7.805412
A boxplot (or box-and-whisker plot) is a compact visual summary of the distribution of a variable. It is built on the five-number summary:
Additionally, boxplots mark outliers as individual points if they fall too far from the central range.
boxplot(Time_ms ~ Algorithm,
data = algo_data,
col = c("lightblue", "lightgreen"),
main = "Boxplot: MergeSort vs QuickSort",
ylab = "Execution Time (ms)")
From the boxplot above, you can immediately tell:
For instance, a longer upper whisker or a cluster of outliers on one side may indicate right skewness.
Outliers are values that lie far from the rest of the data. In boxplots, a value is typically considered an outlier if it is:
\[ \text{Less than } Q_1 - 1.5 \times IQR \quad \text{or} \quad \text{Greater than } Q_3 + 1.5 \times IQR \]
Outliers may represent: - Data entry errors - Exceptional cases (e.g., performance spikes) - True variability
Skewness, on the other hand, refers to asymmetry in the distribution: - Right-skewed: long tail to the right - Left-skewed: long tail to the left
Both skewness and outliers affect the choice of summary statistics and influence which statistical tests are appropriate.
Purpose | Use Histogram | Use Boxplot |
---|---|---|
Understand overall shape | ✅ Yes | Limited |
Detect skewness and modality | ✅ Yes | Partially |
Compare multiple groups | ❌ Not easily | ✅ Very effective |
See outliers | ❌ Not directly | ✅ Clearly marked |
Visualize IQR and spread | ❌ No | ✅ Built-in |
In statistics, a distribution describes how the values of a variable are spread or arranged across the number line. It shows the frequency (or probability) with which different values occur in a dataset or population.
Understanding distributions is fundamental to all of statistical reasoning, because many methods — such as confidence intervals, hypothesis testing, and model fitting — depend not only on the values themselves, but also on how those values are distributed.
You can think of a distribution as a kind of statistical fingerprint for a variable. It tells us:
Distributions can be empirical (based on observed data) or theoretical (based on probability models, like the normal distribution).
Let’s look at some common types of distributions seen in real-world and simulated data:
Distribution Type | Description |
---|---|
Uniform | All values occur with equal frequency |
Normal (Gaussian) | Symmetrical bell curve centered around the mean |
Right-skewed | Long tail to the right (e.g., income, execution spikes) |
Left-skewed | Long tail to the left (e.g., time-to-failure in some systems) |
Bimodal | Two distinct peaks (e.g., mixture of two groups or processes) |
Multimodal | More than two peaks — complex or overlapping subgroups |
These differences are not cosmetic — they affect which summaries are meaningful and which tests are valid.
Distributions can also be classified by the type of data they represent:
Discrete distributions deal with countable
values
Examples: number of users, frequency of clicks, days active
(e.g., Binomial, Poisson)
Continuous distributions represent variables
that can take on any value within a range
Examples: time, temperature, height, algorithm execution time
(e.g., Normal, Exponential)
Understanding whether your variable is discrete or continuous affects: - The choice of visualization (e.g., bar plot vs histogram) - The type of probability model to apply - The interpretation of statistical results
Many statistical methods are based on assumptions about the shape of the distribution:
Violating these assumptions can lead to misleading conclusions.
Even more practically: distributional characteristics help you choose the right summary statistics, determine if your results are reliable, and communicate findings more effectively in your thesis or publication.
Excellent — here is the next subsection, continuing smoothly from where we left off:
Once we understand what a distribution is, the next step is to describe its shape. This helps us decide which statistical summaries and tests are appropriate. Three of the most important shape-related features of a distribution are:
A distribution is symmetric if the left and right sides are roughly mirror images of each other when plotted.
Symmetry is an important assumption behind many parametric statistical tests, such as t-tests and ANOVA.
Modality refers to the number of “peaks” (modes) in a distribution.
Term | Meaning | Example Use Case |
---|---|---|
Unimodal | One peak | Most natural or well-modeled measurements |
Bimodal | Two distinct peaks | Two user populations, algorithm A vs B response |
Multimodal | More than two peaks | Overlapping subpopulations or processes |
Understanding modality can reveal the presence of mixture effects in your data — for example, if your performance data includes multiple system configurations, it may produce multiple peaks.
Skewness measures the asymmetry of a distribution.
A distribution is right-skewed (positively skewed) if it has a long tail on the right side (high values). \[ \text{Mean} > \text{Median} > \text{Mode} \]
A distribution is left-skewed (negatively skewed) if it has a long tail on the left side (low values). \[ \text{Mean} < \text{Median} < \text{Mode} \]
Skewness affects:
You can detect these features using histograms, density plots, or boxplots. Consider this example:
set.seed(101)
right_skewed <- c(rnorm(40, mean = 50, sd = 5), 90, 95, 100)
hist(right_skewed, col = "lightgray", breaks = 12,
main = "Right-Skewed Distribution", xlab = "Value", freq = FALSE)
lines(density(right_skewed), col = "blue", lwd = 2)
abline(v = mean(right_skewed), col = "red", lwd = 2, lty = 1)
abline(v = median(right_skewed), col = "darkgreen", lwd = 2, lty = 2)
legend("topright", legend = c("Mean", "Median"),
col = c("red", "darkgreen"), lty = 1:2, cex = 0.9)
Notice how:
The normal distribution, also known as the Gaussian distribution, is one of the most important concepts in statistics. It serves as the foundation for many statistical methods, including t-tests, ANOVA, regression analysis, and confidence intervals.
Understanding its shape, properties, and implications is essential for interpreting statistical results correctly and deciding which methods are appropriate.
The normal distribution is a continuous probability distribution characterized by:
\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]
Key properties:
Because of its symmetry:
In graphical terms, the curve starts low, rises to a peak at the mean, and then falls symmetrically.
The empirical rule, also known as the 68–95–99.7 rule, is a convenient rule of thumb that describes how data are distributed around the mean in a normal distribution.
It tells us what percentage of values are expected to lie within one, two, or three standard deviations from the mean:
Range | Coverage (Approx.) | Description |
---|---|---|
Mean ± 1 SD | 68% of data | Most data are close to the center |
Mean ± 2 SD | 95% of data | Almost all data fall here |
Mean ± 3 SD | 99.7% of data | Extremely rare values fall beyond this |
Let’s say your algorithm’s run-time follows a normal distribution with:
Then:
This rule gives us a baseline expectation for where most of our data should fall. If we observe values far outside these boundaries, they may be considered outliers, errors, or rare events worth investigating.
This rule is foundational for:
This diagram illustrates how data cluster around the mean and shows why standard deviation is such a meaningful measure of spread in normal data.
The normal distribution plays a central role in statistics for several reasons:
Theoretical Basis
Many random phenomena (e.g., biological measurements, measurement
errors, response times) tend to approximate a normal distribution in
large samples.
Analytical Convenience
The properties of the normal distribution allow for exact
calculation of probabilities, critical values, and
thresholds.
Foundation for Parametric Methods
Methods like the t-test, regression, and ANOVA assume (either explicitly
or via the Central Limit Theorem) that the sampling distribution is
approximately normal.
CLT Linkage
Even if the raw data are not normal, the
sampling distribution of the mean tends to be normal
under certain conditions — a result known as the Central Limit
Theorem, which we will discuss next.
Let’s calculate the mean and standard
deviation of execution times for each algorithm using
tapply()
.
# Mean and SD by group
qs_mean <- mean(algo_data$Time_ms[algo_data$Algorithm == "QuickSort"])
qs_sd <- sd(algo_data$Time_ms[algo_data$Algorithm == "QuickSort"])
ms_mean <- mean(algo_data$Time_ms[algo_data$Algorithm == "MergeSort"])
ms_sd <- sd(algo_data$Time_ms[algo_data$Algorithm == "MergeSort"])
# Print them nicely
cat("QuickSort: Mean =", round(qs_mean, 2), ", SD =", round(qs_sd, 2), "\n")
## QuickSort: Mean = 52.41 , SD = 7.53
cat("MergeSort: Mean =", round(ms_mean, 2), ", SD =", round(ms_sd, 2), "\n")
## MergeSort: Mean = 47.39 , SD = 5.25
Suppose the output is (actual numbers may vary due to random generation):
MergeSort: Mean = 48.48 , SD = 4.53
Then, applying the empirical rule:
Here’s how to plot the histogram and overlay the standard deviation
boundaries without using dplyr
.
The Central Limit Theorem (CLT) is one of the most important and widely used results in all of statistics. It provides the mathematical justification for using methods like t-tests and confidence intervals even when the data do not perfectly follow a normal distribution.
It is often described as the bridge between real-world data, which can be messy or skewed, and idealized statistical methods, which often assume normality.
Suppose you repeatedly take samples from a population and compute the mean of each sample. The distribution of these means is called the sampling distribution of the sample mean.
The Central Limit Theorem says:
As the sample size increases, the sampling distribution of the sample mean becomes approximately normal, regardless of the shape of the original population.
More formally:
\[ \bar{x} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \]
Many statistical techniques — such as confidence intervals and t-tests — assume normality. But in real-world research (including algorithm performance tests), the data may be skewed, bimodal, or contain outliers.
The CLT tells us:
This is why CLT is especially important for postgraduate research: it justifies the methods you use, even when the data seem imperfect.
To simulate a non-normal population (e.g., a right-skewed exponential distribution), then repeatedly sample from it and observe the distribution of the sample means.
You should notice that:
This illustrates the core idea: as sample size increases, the distribution of sample means becomes approximately normal, even when the population is not.
Statistical methods such as t-tests and confidence intervals rely on the assumption that the underlying data (or at least the sampling distribution of the mean) is approximately normal.
While formal tests exist (like the Shapiro-Wilk test), it is best practice to begin with graphical checks. They provide intuitive insight into the shape, symmetry, skewness, and potential outliers in the data.
We will now use actual data from our experiment (MergeSort vs QuickSort) to demonstrate:
These will help answer the question:
Can we assume normality for each algorithm’s execution
times?
A histogram shows the distribution of values across bins and helps detect overall shape.
How to interpret:
A density plot gives a smoothed estimate of the distribution. This reduces noise and makes it easier to see skewness or irregularities.
How to interpret:
QQ plots compare your sample data’s quantiles to those of a perfect normal distribution. This is the most reliable visual check for normality in small samples.
How to interpret:
Visual Tool | MergeSort | QuickSort |
---|---|---|
Histogram | Slightly skewed right? | Nearly symmetric |
Density | Smooth, possible tail | Fairly bell-shaped |
QQ Plot | Close to the line | Close, minor deviations |
Neither distribution is perfectly normal — and that’s expected. But both are “normal enough”, especially with n = 30, so we can safely proceed with t-tests and confidence intervals thanks to the CLT.
Use this guide when preparing to apply statistical tests that assume normality, such as t-tests, ANOVA, or constructing confidence intervals.
If normality cannot be assumed, consider non-parametric tests, such as:
While visual tools like histograms and QQ plots give an intuitive sense of distribution shape, statistical analysis also benefits from a formal test of normality. One of the most commonly used tests for this purpose is the Shapiro–Wilk test.
It is particularly useful for small to medium-sized datasets, and it’s widely supported in statistical software like R.
The Shapiro–Wilk test evaluates whether your data could reasonably come from a normal distribution. It does so by:
If your data are very different from normal (e.g., heavily skewed, multimodal), the test will yield a small W value and a low p-value.
Like most tests, the Shapiro–Wilk test is built on null and alternative hypotheses:
This means the p-value tells us how likely it is to see the given data if the population were normal.
If p > 0.05, we do not reject
H₀:
→ The data could be normal (normality is
plausible).
If p < 0.05, we reject
H₀:
→ The data are significantly non-normal.
⚠️ Important: A high p-value does not prove that the data are normal — it only means we don’t have strong evidence against normality.
Let’s run the test for MergeSort and QuickSort execution times.
qs_data <- algo_data$Time_ms[algo_data$Algorithm == "QuickSort"]
ms_data <- algo_data$Time_ms[algo_data$Algorithm == "MergeSort"]
shapiro.test(qs_data)
##
## Shapiro-Wilk normality test
##
## data: qs_data
## W = 0.96209, p-value = 0.35
shapiro.test(ms_data)
##
## Shapiro-Wilk normality test
##
## data: ms_data
## W = 0.93428, p-value = 0.06386
Sample output:
Shapiro-Wilk normality test
data: qs_data
W = 0.9765, p-value = 0.627
data: ms_data
W = 0.9531, p-value = 0.152
A confidence interval (CI) is a range of values, calculated from sample data, that is likely to contain the true population parameter — such as a mean, proportion, or difference in means.
It reflects both the sample estimate and the uncertainty due to sampling variation.
For example, if we say:
“MergeSort was faster by 4.1 milliseconds, 95% CI [1.6, 8.4]”
It means:
We are 95% confident that the true difference in
performance lies somewhere between 1.6 and 8.4
milliseconds in favor of MergeSort.
This is often misunderstood, so let’s be precise.
A 95% confidence interval does not mean that there is a 95% chance the true value is in this particular interval.
Instead:
If we repeated the same experiment 100 times, and calculated a confidence interval from each sample, we expect 95 of those intervals to contain the true population parameter.
The “95%” refers to the long-run success rate of the method, not the probability of the truth being in one interval.
A common confusion in thesis writing is using mean ± standard error (SE) and assuming it’s a confidence interval. It is not.
Here’s the difference:
Standard Error (SE) tells you how much the sample mean is expected to vary from sample to sample. \[ SE = \frac{s}{\sqrt{n}} \]
A Confidence Interval uses the SE to define a margin of error: \[ CI = \bar{x} \pm t \times SE \] where \(t\) is the critical value from the t-distribution (e.g., ≈ 2 for 95%).
Think of SE as the raw uncertainty, and the CI as a range of plausible values, based on how uncertain we are.
The easiest way to compute a CI is by using t.test()
.
Even if you don’t need the p-value, this function gives you:
qs_data <- algo_data$Time_ms[algo_data$Algorithm == "QuickSort"]
ms_data <- algo_data$Time_ms[algo_data$Algorithm == "MergeSort"]
qs_ci <- t.test(qs_data)$conf.int
ms_ci <- t.test(ms_data)$conf.int
cat("QuickSort CI:", round(qs_ci, 2), "\n")
## QuickSort CI: 49.6 55.22
cat("MergeSort CI:", round(ms_ci, 2), "\n")
## MergeSort CI: 45.43 49.35
This will print something like:
QuickSort CI: [50.2, 54.9]
MergeSort CI: [46.9, 50.0]
Visualizing uncertainty helps you and your audience understand how confident your estimates are.
The plot below shows the mean execution time for each algorithm, along with its 95% confidence interval:
To interpret this plot:
Sometimes we’re interested in the difference between two groups. For example:
“Is MergeSort significantly faster than QuickSort?”
We can construct a CI for the difference of means:
t.test(Time_ms ~ Algorithm, data = algo_data)$conf.int
## [1] -8.384496 -1.657631
## attr(,"conf.level")
## [1] 0.95
If the output is:
[1] -8.38 -1.66
This means:
However, if the CI had been something like:
[-1.23, 4.56]
Then:
A hypothesis is a declarative statement that proposes a possible explanation or expected outcome based on prior knowledge or observation. It must be:
Statistical analysis is not only about describing data — it’s about making decisions and drawing conclusions from it. When researchers ask questions like:
Hypothesis testing is a formal procedure for evaluating a
research claim using data. The logic is based on
contradiction:
We assume a null hypothesis is true, then test whether
the observed data are consistent or inconsistent with
that assumption.
If the data are very unlikely under the assumption, we reject it.
In essence, hypothesis testing asks:
“If the null hypothesis were true, how likely is it that we’d see results like these?”
Every hypothesis test involves two opposing statements:
Null Hypothesis (H₀)
The default claim. It assumes no effect, no
difference, or no association.
[Example:\\](Example:){.uri} “There is no difference
in mean execution time between MergeSort and QuickSort.”
Alternative Hypothesis (H₁ or Ha)
The research claim we hope to support. It suggests there
is a difference or effect.
[Example:\\](Example:){.uri} “MergeSort has a faster
mean execution time than QuickSort.”
The test evaluates whether the data provide enough evidence to reject H₀ in favor of H₁.
We never “prove” H₁ directly — we either reject or fail to reject the null.
The p-value is the probability of obtaining results as extreme or more extreme than what we observed, assuming H₀ is true.
A small p-value (typically less than 0.05) means
the observed result is unlikely under H₀
→ We reject H₀ and say the result is
statistically significant
A large p-value means the observed result is
plausible under H₀
→ We fail to reject H₀
A p-value is not the probability that H₀ is true.
It is the probability of seeing this data if H₀ were true.
A p-value of 0.03 means:
“If the algorithms had equal performance, we’d see this result (or something more extreme) only 3% of the time.”
Statistical decisions are based on probabilities, not certainties — so errors can occur.
Type | Description | Symbol | Example |
---|---|---|---|
Type I Error | Rejecting H₀ when it is actually true | \(\alpha\) | Claiming MergeSort is faster when it isn’t |
Type II Error | Failing to reject H₀ when it is false | \(\beta\) | Failing to detect a real speed difference |
We control Type I errors by setting a significance
level \(\alpha\).
We reduce Type II errors by increasing sample size,
reducing variability, or using better tests.
When formulating hypotheses, researchers must also decide what kind of difference they’re testing for.
A two-tailed test checks for any
difference, either higher or lower.
[Example:\\](Example:){.uri} H₀: µ₁ = µ₂
H₁: µ₁ ≠ µ₂
A one-tailed test checks for a difference in a
specific direction.
[Example:\\](Example:){.uri} H₀: µ₁ ≥ µ₂
H₁: µ₁ < µ₂
Misusing one-tailed tests after seeing the results leads to bias and invalid conclusions.
The significance level \(\alpha\) is the threshold for decision-making.
If:
You may choose stricter levels (e.g., 0.01) when the consequences of a false positive are high (e.g., in medicine or critical systems).
A strong hypothesis:
Is grounded in prior knowledge or
observation
– It should be informed by theory, literature, or empirical
results.
Clearly identifies the variables
– What are you comparing or measuring?
Is framed in a way that allows for statistical
testing
– For example, using terms like “greater than,” “different from,” or
“associated with.”
Uses operational definitions
– Specifies how each variable will be measured or categorized.
Research Question | Hypothesis (H₀ / H₁) |
---|---|
Is MergeSort faster than QuickSort on large inputs? | H₀: µ₁ = µ₂ (mean time is equal) |
H₁: µ₁ < µ₂ (MergeSort mean time is less than QuickSort) | |
Does UI design affect user satisfaction? | H₀: No difference in satisfaction scores across designs |
H₁: At least one design has a different mean satisfaction | |
Are male and female students equally likely to pass? | H₀: Pass rate is independent of gender |
H₁: Pass rate depends on gender |
A well-formed hypothesis does more than guide analysis. It:
Later in this course, when we discuss hypothesis testing, p-values, and significance, you’ll see how everything starts from how well the hypothesis was written.
Concept | Role in Hypothesis Testing |
---|---|
H₀ / H₁ | Competing claims about population parameters |
p-value | Probability of result under H₀ |
α level | Cutoff for rejecting H₀ |
Significance | Whether the result is unlikely under H₀ |
Type I / II errors | Risks in decision-making |
Tail direction | Determines how evidence is evaluated (1- vs 2-sided) |
Now that we’ve covered how to formulate hypotheses, assess assumptions, and interpret confidence and significance, we are ready to explore the core statistical tests used in real-world research — especially in computer science and engineering contexts.
These tests allow researchers to formally answer questions such as:
The t-test is one of the most widely used statistical tests in scientific research. It allows us to compare mean values and determine whether the observed difference between them is statistically significant.
There are three main types of t-tests:
Before using any type of t-test, the following assumptions should be checked:
Use when: You want to test whether a single sample mean differs from a known value (benchmark or theoretical mean).
“Does MergeSort have a mean execution time different from 50 milliseconds?”
ms_data <- algo_data$Time_ms[algo_data$Algorithm == "MergeSort"]
t.test(ms_data, mu = 50)
##
## One Sample t-test
##
## data: ms_data
## t = -2.7223, df = 29, p-value = 0.01085
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 45.42997 49.35095
## sample estimates:
## mean of x
## 47.39046
The test result includes: - Mean of the sample - t-statistic - p-value - Confidence interval for the mean
If the p-value < 0.05, we reject the null and conclude that the MergeSort mean time differs significantly from 50 ms.
Use when: You want to compare the means of two unrelated groups.
“Is there a significant difference in mean execution time between MergeSort and QuickSort?”
t.test(Time_ms ~ Algorithm, data = algo_data, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: Time_ms by Algorithm
## t = -2.9959, df = 51.806, p-value = 0.004192
## alternative hypothesis: true difference in means between group MergeSort and group QuickSort is not equal to 0
## 95 percent confidence interval:
## -8.384496 -1.657631
## sample estimates:
## mean in group MergeSort mean in group QuickSort
## 47.39046 52.41152
var.equal = FALSE
by default (Welch’s t-test),
which does not assume equal variancesThis is the most relevant test for your central
thesis question:
> “Is MergeSort significantly faster than QuickSort?”
Use when: You measure two conditions on the same subjects — for example, run both algorithms on the same inputs.
Let’s simulate that situation for illustration.
We pretend we used the same 30 datasets and ran both algorithms on each.
set.seed(123)
input_ids <- 1:30
qs_times <- rnorm(30, mean = 52, sd = 6)
ms_times <- rnorm(30, mean = 48, sd = 5)
paired_data <- data.frame(
Input = input_ids,
QuickSort = qs_times,
MergeSort = ms_times
)
t.test(paired_data$QuickSort, paired_data$MergeSort, paired = TRUE)
##
## Paired t-test
##
## data: paired_data$QuickSort and paired_data$MergeSort
## t = 2.001, df = 29, p-value = 0.05483
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.06243577 5.71380733
## sample estimates:
## mean difference
## 2.825686
While t-tests are ideal for comparing two means, many research questions in computer science involve three or more groups. In such cases, performing multiple t-tests increases the risk of Type I error. The solution is to use ANOVA — Analysis of Variance.
ANOVA tests whether there is a statistically significant difference in means across multiple groups, while maintaining control over the overall error rate.
Instead of asking “Which mean is different?”, ANOVA first asks:
“Is there any significant difference among the group means?”
Use one-way ANOVA when:
Let’s simulate a third sorting algorithm: HeapSort, and compare it alongside MergeSort and QuickSort using a one-way ANOVA.
set.seed(2025)
qs <- rnorm(30, mean = 52, sd = 6)
ms <- rnorm(30, mean = 48, sd = 5)
hs <- rnorm(30, mean = 50, sd = 4.5)
algo3_data <- data.frame(
Algorithm = rep(c("QuickSort", "MergeSort", "HeapSort"), each = 30),
Time_ms = c(qs, ms, hs)
)
anova_model <- aov(Time_ms ~ Algorithm, data = algo3_data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Algorithm 2 408.2 204.12 7.499 0.00099 ***
## Residuals 87 2368.2 27.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If the p-value < 0.05, we reject the null and conclude that there is a significant difference in performance between at least two algorithms.
ANOVA tells us there is a difference — but not which groups differ. For that, we use Tukey’s Honest Significant Difference (HSD) test.
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Time_ms ~ Algorithm, data = algo3_data)
##
## $Algorithm
## diff lwr upr p adj
## MergeSort-HeapSort -0.9764591 -4.1886528 2.235735 0.7494717
## QuickSort-HeapSort 3.9499434 0.7377497 7.162137 0.0118579
## QuickSort-MergeSort 4.9264025 1.7142088 8.138596 0.0012577
This test compares all pairs of algorithms and adjusts for multiple testing. It shows: - Mean differences between groups - Confidence intervals - p-values adjusted for multiple comparisons
A boxplot can help visualize group differences and identify potential outliers or skewness.
To ensure valid results, ANOVA relies on a few key assumptions:
Assumption | How to Check |
---|---|
Independence | Data should be collected independently across groups |
Normality | Each group’s data should be roughly normal → check with QQ plot or Shapiro-Wilk |
Homogeneity of variance | Group variances should be similar → check with Bartlett or Levene test |
bartlett.test(Time_ms ~ Algorithm, data = algo3_data)
##
## Bartlett test of homogeneity of variances
##
## data: Time_ms by Algorithm
## Bartlett's K-squared = 1.3655, df = 2, p-value = 0.5052
If p > 0.05, the variances are not significantly different → good!
Use two-way ANOVA when:
Suppose we test each algorithm on two types of datasets: Small and Large inputs.
set.seed(123)
algo <- rep(c("QuickSort", "MergeSort", "HeapSort"), times = 30)
size <- rep(rep(c("Small", "Large"), each = 15), times = 3)
runtime <- rnorm(90, mean = ifelse(algo == "MergeSort", 48,
ifelse(algo == "QuickSort", 52, 50)) +
ifelse(size == "Large", 3, 0),
sd = 4.5)
two_factor_data <- data.frame(
Algorithm = algo,
InputSize = size,
Time_ms = runtime
)
two_way_model <- aov(Time_ms ~ Algorithm * InputSize, data = two_factor_data)
summary(two_way_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Algorithm 2 477.3 238.67 14.485 3.94e-06 ***
## InputSize 1 142.4 142.44 8.645 0.00424 **
## Algorithm:InputSize 2 4.9 2.45 0.149 0.86182
## Residuals 84 1384.1 16.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model will test: - The main effect of Algorithm - The main effect of Input Size - The interaction between Algorithm × Input Size
A significant interaction means the performance difference between algorithms depends on input size.
Test | When to Use | What It Tells You |
---|---|---|
One-way ANOVA | Compare ≥3 groups | Is there any difference among group means? |
Tukey HSD | After ANOVA | Which groups differ from each other |
Two-way ANOVA | Two factors (e.g., Algorithm × Size) | Main effects and interaction |
Assumption checks | Before running ANOVA | Validates test reliability |
ANOVA provides a robust framework to test complex experimental designs — ideal for evaluating multiple systems, configurations, or algorithms.
While t-tests and ANOVA are designed for numerical data, many research questions in computer science involve categorical outcomes — such as:
To analyze such data, we use Chi-Squared (χ²) Tests, which allow us to evaluate whether observed counts differ from expected counts under a certain hypothesis.
Use when: You want to test whether one categorical variable fits an expected distribution.
Suppose we test HeapSort on 60 datasets and record the number of Successful vs Failed runs. Based on theory, we expect a 95% success rate. Is the observed distribution consistent with that?
# Observed outcomes: Success and Fail
observed <- c(Success = 54, Fail = 6)
# Expected probabilities: 95% success, 5% fail
expected_probs <- c(0.95, 0.05)
expected_counts <- sum(observed) * expected_probs
chisq.test(x = observed, p = expected_probs)
## Warning in chisq.test(x = observed, p = expected_probs): Chi-squared
## approximation may be incorrect
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 3.1579, df = 1, p-value = 0.07556
Use when: You want to test whether two categorical variables are independent.
You record the outcome (Success or Failure) of each sorting algorithm across 30 trials. Does algorithm choice affect outcome?
set.seed(42)
algorithms <- rep(c("QuickSort", "MergeSort", "HeapSort"), each = 30)
# Simulate outcome: mostly successful, but vary by algorithm
outcomes <- c(
sample(c("Success", "Fail"), 30, replace = TRUE, prob = c(0.93, 0.07)),
sample(c("Success", "Fail"), 30, replace = TRUE, prob = c(0.98, 0.02)),
sample(c("Success", "Fail"), 30, replace = TRUE, prob = c(0.90, 0.10))
)
cat_data <- data.frame(
Algorithm = algorithms,
Outcome = outcomes
)
# Create contingency table
table_data <- table(cat_data$Algorithm, cat_data$Outcome)
chisq.test(table_data)
## Warning in chisq.test(table_data): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table_data
## X-squared = 9.6041, df = 2, p-value = 0.008213
Assumption | Explanation |
---|---|
Expected counts ≥ 5 | Each cell in the contingency table should have at least 5 expected values |
Independent observations | Each outcome should be from a distinct trial or unit |
Sufficient sample size | Chi-squared tests are not reliable for very small datasets |
You can check expected counts directly:
chisq.test(table_data)$expected
## Warning in chisq.test(table_data): Chi-squared approximation may be incorrect
##
## Fail Success
## HeapSort 2.333333 27.66667
## MergeSort 2.333333 27.66667
## QuickSort 2.333333 27.66667
If any expected count < 5, consider: - Combining categories - Using Fisher’s Exact Test (for 2×2 tables and small samples)
Chi-squared tests are ideal when analyzing: - Error types by method - Crash rate by platform - Pass/fail rates by algorithm - Survey preferences by demographic
These tests give categorical results statistical grounding, enabling you to support claims about relationships and proportions with confidence.
Correlation analysis is used to determine whether two numeric variables move together — and if so, how strongly and in what direction.
In computer science research, this is useful when investigating relationships like:
However, it’s crucial to remember:
> Correlation measures association, not cause and
effect.
Value of r | Interpretation |
---|---|
+1 | Perfect positive linear correlation |
0 | No linear correlation |
-1 | Perfect negative linear correlation |
Let’s simulate a relationship between input size and algorithm run-time:
set.seed(123)
input_size <- seq(100, 1000, length.out = 50)
qs_runtime <- 0.05 * input_size + rnorm(50, mean = 0, sd = 10)
cor(input_size, qs_runtime) # Pearson correlation
## [1] 0.8185339
This computes the Pearson correlation coefficient between input size and QuickSort run time.
cor(input_size, qs_runtime, method = "spearman")
## [1] 0.8128211
Always pair correlation with a scatterplot. This helps:
plot(input_size, qs_runtime,
main = "Input Size vs QuickSort Time",
xlab = "Input Size", ylab = "Run Time (ms)",
col = "steelblue", pch = 16)
abline(lm(qs_runtime ~ input_size), col = "red", lwd = 2)
You can also view relationships between multiple variables using a correlation matrix:
df <- data.frame(
Size = input_size,
QuickSort = qs_runtime,
Noise = rnorm(50),
Quadratic = input_size^2 + rnorm(50, sd = 100)
)
round(cor(df), 2)
## Size QuickSort Noise Quadratic
## Size 1.00 0.82 0.13 0.98
## QuickSort 0.82 1.00 0.09 0.81
## Noise 0.13 0.09 1.00 0.16
## Quadratic 0.98 0.81 0.16 1.00
This produces a matrix of correlation values for multiple numeric variables.
A high correlation between A and B does not imply that A causes B.
Correlation may arise because: - A causes B
- B causes A
- A and B are both affected by a third variable (confounding) - It’s a
coincidence (especially in small datasets)
Always interpret correlations in context, and support with theory or controlled experimentation.
Method | Use Case | Notes |
---|---|---|
Pearson | Numeric, linear, normal | Most common for numeric variables |
Spearman | Ordinal or monotonic | Robust to outliers and non-linearity |
Scatterplot | Visual confirmation | Shows shape and outliers |
Correlation matrix | Many variables | Explore pairwise relationships |
Correlation is a powerful exploratory tool — but not a substitute for causal inference or experimental validation.
Not all datasets meet the assumptions required for parametric tests like t-tests and ANOVA. In particular:
In such situations, we turn to non-parametric statistical tests.
These tests do not assume normality and are based on ranks, not actual values.
They are especially useful for: - Small sample sizes - Median comparisons - Ordinal (ranked) data - Robust testing in presence of outliers
You test QuickSort and MergeSort on the same 30 datasets, but suspect the differences aren’t normally distributed.
set.seed(1)
qs_times <- rnorm(30, mean = 52, sd = 7)
ms_times <- rnorm(30, mean = 48, sd = 7)
# Apply Wilcoxon signed-rank test
wilcox.test(qs_times, ms_times, paired = TRUE)
##
## Wilcoxon signed rank exact test
##
## data: qs_times and ms_times
## V = 340, p-value = 0.02623
## alternative hypothesis: true location shift is not equal to 0
This test is robust to outliers, unlike the paired t-test.
Compare MergeSort and QuickSort performance across two independent datasets, but the data appear skewed.
# Group labels
group <- rep(c("QuickSort", "MergeSort"), each = 30)
times <- c(qs_times, ms_times)
# Apply Mann–Whitney U test
wilcox.test(times ~ group)
##
## Wilcoxon rank sum exact test
##
## data: times by group
## W = 276, p-value = 0.009604
## alternative hypothesis: true location shift is not equal to 0
This test does not compare means — it compares ranks and medians.
Suitable when the scale is ordinal, or data is skewed/heavy-tailed.
Compare QuickSort, MergeSort, and HeapSort on run-time data where distributions are not normal.
group <- rep(c("QuickSort", "MergeSort", "HeapSort"), each = 30)
times <- c(
rnorm(30, mean = 52, sd = 6),
rnorm(30, mean = 48, sd = 6),
rnorm(30, mean = 50, sd = 6)
)
kruskal.test(times ~ group)
##
## Kruskal-Wallis rank sum test
##
## data: times by group
## Kruskal-Wallis chi-squared = 11.135, df = 2, p-value = 0.00382
If significant, you may follow up with pairwise Wilcoxon tests and adjust for multiple comparisons.
Test | Parametric Equivalent | Use Case | Notes |
---|---|---|---|
Wilcoxon signed-rank | Paired t-test | Two related samples | Tests medians |
Mann–Whitney U | Two-sample t-test | Two independent samples | Tests rank tendency |
Kruskal–Wallis | One-way ANOVA | 3+ independent groups | Tests rank distributions |
Use non-parametric tests when:
This decision matrix helps you identify the most appropriate test based on:
Research Goal | Data Type | Groups | Parametric? | Test to Use |
---|---|---|---|---|
Compare sample mean to known value | Numeric | 1 group | ✔️ Yes | One-sample t-test |
Compare two means (independent) | Numeric | 2 groups | ✔️ Yes | Two-sample t-test |
Compare two means (paired) | Numeric | 2 paired sets | ✔️ Yes | Paired t-test |
Compare 3+ means (independent) | Numeric | 3+ groups | ✔️ Yes | One-way ANOVA |
Compare 3+ means with 2 factors | Numeric | 3+ groups, 2 factors | ✔️ Yes | Two-way ANOVA |
Compare two medians (independent) | Numeric or ordinal | 2 groups | ❌ No | Mann–Whitney U test |
Compare two medians (paired) | Numeric or ordinal | 2 paired sets | ❌ No | Wilcoxon signed-rank test |
Compare 3+ medians (independent) | Numeric or ordinal | 3+ groups | ❌ No | Kruskal–Wallis test |
Compare observed vs expected counts | Categorical | 1 variable | N/A | Chi-squared goodness-of-fit |
Test association between two categorical vars | Categorical | 2 variables | N/A | Chi-squared test of independence |
Test correlation between 2 numeric vars | Numeric | 2 variables | ✔️ Linear | Pearson correlation |
Test rank-based correlation | Ordinal or non-linear | 2 variables | ❌ No | Spearman correlation |
Identify your research question
→ What are you trying to compare or test?
Classify your data type
→ Is it numeric (continuous), ordinal, or categorical?
Count the groups or variables
→ One sample? Two groups? Multiple factors?
Check your assumptions
→ Is your data normal? Are variances equal?
Select the test
→ Use the matrix above to choose the appropriate test