Data Analysis Masterclass

From Hypothesis Testing to Predictive Modeling

Author

Abdullah Al Shamim

Published

March 15, 2026

Introduction

Statistical analysis allows us to look past raw numbers to find meaningful patterns. This document covers the core workflow of a data analyst: testing differences between groups, checking associations between categories, and predicting future trends.


1. Hypothesis Testing: The T-Test

A Two-Sample T-Test is used to determine if the average (mean) of two groups is significantly different. We are looking to see if the gap between Africa and Europe regarding Life Expectancy is a real phenomenon.

Visualization: Density Distributions

Code
library(gapminder)
library(tidyverse)
library(plyr)
Code
# Data Preparation
df <- gapminder %>% 
  select(continent, lifeExp) %>% 
  filter(continent %in% c("Africa", "Europe"))

# Calculate group means for plotting
mu <- ddply(df, "continent", summarise, grp.mean = mean(lifeExp))

# Density Plot
df %>% 
  ggplot(aes(lifeExp, fill = continent)) +
  geom_density(alpha = 0.4) +
  theme_classic(base_size = 11) +
  geom_vline(data = mu, aes(xintercept = grp.mean), linetype = "dashed") +
  annotate("text", x = 62, y = 0.06, label = "Mean Life Exp \n in Europe = 71.9", size = 3) +
  annotate("text", x = 38, y = 0.06, label = "Mean Life Exp \n in Africa = 48.9", size = 3) +
  scale_fill_manual(values = c("Europe" = "#03E4D8", "Africa" = "#005F7A")) +
  labs(title = "Life Expectancy in Africa and Europe", x = "Life Expectancy", y = "Probability")

Performing the T-Test

Interpreting the Result
  • p-value: If the result is < 0.05, we reject the idea that the continents have the same life expectancy.
  • Interpretation: We observe a significant difference; people in Europe typically live longer than those in Africa within this dataset.
Code
gapminder %>% 
  filter(continent %in% c("Africa", "Europe")) %>% 
  t.test(lifeExp ~ continent, data = ., alternative = "two.sided")

    Welch Two Sample t-test

data:  lifeExp by continent
t = -49.551, df = 981.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
95 percent confidence interval:
 -23.95076 -22.12595
sample estimates:
mean in group Africa mean in group Europe 
            48.86533             71.90369 

2. Analysis of Variance (ANOVA)

ANOVA is used when comparing three or more groups. It checks if at least one group’s mean is different from the others.

Goal: Compare Flipper Length across three distinct species of penguins.

Visualization: Grouped Comparison

Code
gapminder %>%
  filter(continent %in% c("Americas", "Europe", "Asia")) %>% 
  ggplot(aes(continent, lifeExp, colour = continent)) +
  geom_boxplot(outliers = FALSE) +
  stat_summary(fun = "mean", size = 1, alpha = 0.5) +
  coord_flip() +
  theme_classic(base_size = 11) +
  theme(legend.position = "none") +
  scale_color_manual(values = c("Europe" = "#03E4D8", "Americas" = "#005F7A", "Asia" = "#00A0A9")) +
  labs(title = "Life Expectancy in Americas, Asia, and Europe", x = "Continent", y = "Life Expectancy")

ANOVA & Post-Hoc Test (Tukey)

The ANOVA tells us if a difference exists; the Tukey HSD test tells us exactly which continents differ from each other.

Code
# ANOVA Test
anova_res <- gapminder %>% 
  filter(year == 2007) %>% 
  filter(continent %in% c("Africa", "Europe", "Asia")) %>% 
  aov(lifeExp ~ continent, data = .)

summary(anova_res)
             Df Sum Sq Mean Sq F value Pr(>F)    
continent     2  11273    5637   89.96 <2e-16 ***
Residuals   112   7017      63                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Tukey HSD (Pairwise Comparison)
TukeyHSD(anova_res)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent, data = .)

$continent
                   diff       lwr      upr     p adj
Asia-Africa   15.922446 11.737968 20.10692 0.0000000
Europe-Africa 22.842562 18.531988 27.15314 0.0000000
Europe-Asia    6.920115  2.177224 11.66301 0.0021526

Layman’s Interpretation

Understanding the “Referee”

ANOVA acts as a referee when you have more than two groups. It compares the variation between species to the variation within each species.

  • F-Statistic: A high F-value suggests the species are distinctly different in flipper length.
  • Tukey HSD: ANOVA only says “someone is different.” The Tukey test identifies the specific pairs that differ (e.g., Adelie vs. Gentoo). If the pair’s p-value is < 0.05, they are statistically unique.

3. Chi-Squared Test: Categorical Analysis

Chi-squared tests are for categories (e.g., Species, Size).

  • Goodness of Fit: Tests if the distribution of categories matches our expectations.
  • Independence: Tests if one category (Species) is related to another (Size).
Code
# Data Preparation: Discretizing Iris data
flower <- iris %>% 
  mutate(size = cut(Sepal.Length, breaks = 3, labels = c("Small", "Medium","Large"))) %>%
  select(Species, size)
Code
# Goodness of Fit Test
flower %>% select(size) %>% table() %>% chisq.test()

    Chi-squared test for given probabilities

data:  .
X-squared = 28.44, df = 2, p-value = 6.673e-07
Code
# Test of Independence
flower %>% table() %>% chisq.test()

    Pearson's Chi-squared test

data:  .
X-squared = 111.63, df = 4, p-value < 2.2e-16
The Layman’s Rule

If p < 0.05 in the Test of Independence, it means knowing the Species of the flower actually helps you predict its Size.


4. Simple Linear Regression

Regression is the foundation of machine learning. We use one variable (Predictor) to estimate another (Outcome).

Modeling Speed vs. Distance (cars dataset)

Code
# Building the model
cars %>% 
  lm(dist ~ speed, data = .) %>%  
  summary()

Call:
lm(formula = dist ~ speed, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Modeling Weight vs. Height (women dataset)

Code
# Create model
model1 <- lm(weight ~ height, data = women)
summary(model1)

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Predictive Modeling: Forecasting the Future

Once the model is built, we can provide it with new heights to predict what the weight should be.

Code
# New data for prediction
data_new <- data.frame(height = c(66, 44, 99))

# Generate and round outcomes
predictions <- predict(model1, data_new)
round(predictions)
  1   2   3 
140  64 254 

Layman’s Interpretation

Modeling the Relationship (Height vs. Weight)
  • The Slope: This is your Rate of Change. In this model, the slope is approximately 3.45. This means for every 1-inch increase in a woman’s height, the model predicts her weight will increase by roughly 3.45 pounds.
  • R-Squared (\(R^2\)): This represents Model Accuracy. The \(R^2\) for this dataset is incredibly high (approx. 0.99). This means that 99% of the variation in weight is explained by height alone. It is an almost perfect linear relationship, leaving only 1% to “mystery” factors like bone density or muscle mass.

Conclusion Checklist

  • Comparing 2 groups? T-Test.
  • Comparing 3+ groups? ANOVA + Tukey.
  • Finding categorical links? Chi-Squared.
  • Predicting a numeric outcome? Linear Regression.

Quick Analysis Summary Checklist

Goal Test to Use
Compare 2 means T-Test
Compare 3+ means ANOVA
Check for Category Link Chi-Squared Independence
Predict a numeric value Linear Regression