Code
library(gapminder)
library(tidyverse)
library(plyr)From Hypothesis Testing to Predictive Modeling
Statistical analysis allows us to look past raw numbers to find meaningful patterns. This document covers the core workflow of a data analyst: testing differences between groups, checking associations between categories, and predicting future trends.
A Two-Sample T-Test is used to determine if the average (mean) of two groups is significantly different. We are looking to see if the gap between Africa and Europe regarding Life Expectancy is a real phenomenon.
library(gapminder)
library(tidyverse)
library(plyr)# Data Preparation
df <- gapminder %>%
select(continent, lifeExp) %>%
filter(continent %in% c("Africa", "Europe"))
# Calculate group means for plotting
mu <- ddply(df, "continent", summarise, grp.mean = mean(lifeExp))
# Density Plot
df %>%
ggplot(aes(lifeExp, fill = continent)) +
geom_density(alpha = 0.4) +
theme_classic(base_size = 11) +
geom_vline(data = mu, aes(xintercept = grp.mean), linetype = "dashed") +
annotate("text", x = 62, y = 0.06, label = "Mean Life Exp \n in Europe = 71.9", size = 3) +
annotate("text", x = 38, y = 0.06, label = "Mean Life Exp \n in Africa = 48.9", size = 3) +
scale_fill_manual(values = c("Europe" = "#03E4D8", "Africa" = "#005F7A")) +
labs(title = "Life Expectancy in Africa and Europe", x = "Life Expectancy", y = "Probability")gapminder %>%
filter(continent %in% c("Africa", "Europe")) %>%
t.test(lifeExp ~ continent, data = ., alternative = "two.sided")
Welch Two Sample t-test
data: lifeExp by continent
t = -49.551, df = 981.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
95 percent confidence interval:
-23.95076 -22.12595
sample estimates:
mean in group Africa mean in group Europe
48.86533 71.90369
ANOVA is used when comparing three or more groups. It checks if at least one group’s mean is different from the others.
Goal: Compare Flipper Length across three distinct species of penguins.
gapminder %>%
filter(continent %in% c("Americas", "Europe", "Asia")) %>%
ggplot(aes(continent, lifeExp, colour = continent)) +
geom_boxplot(outliers = FALSE) +
stat_summary(fun = "mean", size = 1, alpha = 0.5) +
coord_flip() +
theme_classic(base_size = 11) +
theme(legend.position = "none") +
scale_color_manual(values = c("Europe" = "#03E4D8", "Americas" = "#005F7A", "Asia" = "#00A0A9")) +
labs(title = "Life Expectancy in Americas, Asia, and Europe", x = "Continent", y = "Life Expectancy")The ANOVA tells us if a difference exists; the Tukey HSD test tells us exactly which continents differ from each other.
# ANOVA Test
anova_res <- gapminder %>%
filter(year == 2007) %>%
filter(continent %in% c("Africa", "Europe", "Asia")) %>%
aov(lifeExp ~ continent, data = .)
summary(anova_res) Df Sum Sq Mean Sq F value Pr(>F)
continent 2 11273 5637 89.96 <2e-16 ***
Residuals 112 7017 63
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Tukey HSD (Pairwise Comparison)
TukeyHSD(anova_res) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lifeExp ~ continent, data = .)
$continent
diff lwr upr p adj
Asia-Africa 15.922446 11.737968 20.10692 0.0000000
Europe-Africa 22.842562 18.531988 27.15314 0.0000000
Europe-Asia 6.920115 2.177224 11.66301 0.0021526
ANOVA acts as a referee when you have more than two groups. It compares the variation between species to the variation within each species.
Chi-squared tests are for categories (e.g., Species, Size).
# Data Preparation: Discretizing Iris data
flower <- iris %>%
mutate(size = cut(Sepal.Length, breaks = 3, labels = c("Small", "Medium","Large"))) %>%
select(Species, size)# Goodness of Fit Test
flower %>% select(size) %>% table() %>% chisq.test()
Chi-squared test for given probabilities
data: .
X-squared = 28.44, df = 2, p-value = 6.673e-07
# Test of Independence
flower %>% table() %>% chisq.test()
Pearson's Chi-squared test
data: .
X-squared = 111.63, df = 4, p-value < 2.2e-16
If p < 0.05 in the Test of Independence, it means knowing the Species of the flower actually helps you predict its Size.
Regression is the foundation of machine learning. We use one variable (Predictor) to estimate another (Outcome).
# Building the model
cars %>%
lm(dist ~ speed, data = .) %>%
summary()
Call:
lm(formula = dist ~ speed, data = .)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# Create model
model1 <- lm(weight ~ height, data = women)
summary(model1)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
Once the model is built, we can provide it with new heights to predict what the weight should be.
# New data for prediction
data_new <- data.frame(height = c(66, 44, 99))
# Generate and round outcomes
predictions <- predict(model1, data_new)
round(predictions) 1 2 3
140 64 254
| Goal | Test to Use |
|---|---|
| Compare 2 means | T-Test |
| Compare 3+ means | ANOVA |
| Check for Category Link | Chi-Squared Independence |
| Predict a numeric value | Linear Regression |