Inferential statistics is a branch of statistics used to draw conclusions and make inferences about a larger population based on collected sample data.It has 2 main types or methods it uses which are hypothesis testing and regression analysis.
Steps for Hypothesis testing
Types of Statistical Tests in Hypothesis Testing
Null: Sample mean is same as the population mean.
Alternate: Sample mean is not same as the population mean. Mathematically is calculated by:
\[ z = (x — μ) / (σ / √n) \]
Where:\(x\) is sample mean,\(\mu\) is population mean and $σ / √n $ is population standard deviation
NOTE: If the test statistic is lower than the critical value, accept the hypothesis.
T-test: Compares the means of two groups (used when standard deviation is unknown and/or sample is small).
Chi-square test: Compares categorical variables, like determining whether sample data matches population data (chi-square goodness of fit test) or if two categorical variables are related (chi-square test of independence).
ANOVA (Analysis of Variance): Compares the difference between three or more groups of a single independent variable (one-way ANOVA), or tests the effect of one or more independent variables on two or more dependent variables (MANOVA).
Note: For details and mathematical equations visit HERE
#Test of differences and compare means across groups.
# Example using gapminder
library(gapminder)
library(dplyr)
# t-test: Life expectancy between two continents
gapminder %>%
filter(continent %in% c("Asia", "Europe")) %>%
t.test(lifeExp ~ continent, data = .)
# ANOVA: Life expectancy across all continents
aov_model <- aov(lifeExp ~ continent, data = gapminder)
summary(aov_model)
#Objective: Test association (independence) between categorical variables.
#Tests whether continent group and year are associated.
gapminder %>%
mutate(continent_group = ifelse(continent %in% c("Asia", "Europe"), "Group1", "Group2")) %>%
count(continent_group, year) %>%
xtabs(~ continent_group + year) %>%
chisq.test()
This test checks if Asia has significantly higher life expectancy than Europe. Directional hypotheses are useful for policy-driven questions
#Objective: Formalize statistical decision-making.
# Hypothesis: Life expectancy in Asia > Europe
asia <- gapminder %>% filter(continent == "Asia") %>% pull(lifeExp)
europe <- gapminder %>% filter(continent == "Europe") %>% pull(lifeExp)
t.test(asia, europe, alternative = "greater")
Simulate the educational data with student_id,
year, score, gender
columns
Draw the hypothesis to be tested
Test the difference between gender and scores
set.seed(123)
edu_data <- data.frame(
student_id = rep(1:100, each = sample(1:3, 100, replace = TRUE)),
year = sample(2015:2020, 300, replace = TRUE),
score = rnorm(300, mean = 75, sd = 10),
gender = sample(c("Male", "Female"), 300, replace = TRUE)
)
# t-test: Gender difference in scores
t.test(score ~ gender, data = edu_data)
Objective: Predicting Life Expectancy
lm_model <- lm(lifeExp ~ gdpPercap + pop, data = gapminder)
summary(lm_model)
# Create scatterplot with regression line (based on gdpPercap only for visualization)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ x, color = "black") +
labs(
title = "Life Expectancy vs GDP per Capita",
x = "GDP per Capita",
y = "Life Expectancy"
) +
theme_minimal()
Note: This linear regression helps us to quantify how
GDP and population influence life expectancy. Coefficients
indicate direction and magnitude.
Objective: Modeling Binary Outcomes
# Load required libraries
library(ggplot2)
library(dplyr)
library(gapminder)
# Create binary outcome variable
gapminder <- gapminder %>%
mutate(high_lifeExp = ifelse(lifeExp > 70, 1, 0))
# Fit logistic regression model
binary_model <- glm(high_lifeExp ~ gdpPercap + pop, data = gapminder, family = binomial)
summary(binary_model)
# Add predicted probabilities to the dataset
gapminder <- gapminder %>%
mutate(predicted_prob = predict(binary_model, type = "response"))
# Plot predicted probabilities vs GDP per Capita
ggplot(gapminder, aes(x = gdpPercap, y = predicted_prob)) +
geom_point(aes(color = continent), alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, color = "black") +
labs(
title = "Predicted Probability of High Life Expectancy vs GDP per Capita",
x = "GDP per Capita",
y = "Predicted Probability",
color = "Continent"
) +
theme_minimal()
Note: Binary regression is ideal for yes/no outcomes. And here we predict the likelihood of high life expectancy.
Objective: Logistic Model with Logit Link
# Create binary outcome variable
gapminder <- gapminder %>%
mutate(high_lifeExp = ifelse(lifeExp > 70, 1, 0))
# Fit logistic regression model with logit link
logit_model <- glm(high_lifeExp ~ gdpPercap + pop, data = gapminder, family = binomial(link = "logit"))
# Add predicted log-odds to the dataset
gapminder <- gapminder %>%
mutate(log_odds = predict(logit_model, type = "link"))
# Plot log-odds vs GDP per Capita
ggplot(gapminder, aes(x = gdpPercap, y = log_odds)) +
geom_point(aes(color = continent), alpha = 0.6) +
geom_smooth(method = "loess", se = FALSE, color = "black") +
labs(
title = "Log-Odds of High Life Expectancy vs GDP per Capita",
x = "GDP per Capita",
y = "Log-Odds (Logit)",
color = "Continent"
) +
theme_minimal()
Note: Logistic regression uses the logit link to model probabilities. Coefficients reflect odds ratios.
Objective: Generalized Linear Models
GLMs extend linear models to non-normal outcomes. Here, we gonna use the identity link for a standard regression
Key Concepts:Flexible distributions and Link functions
# Fit Gaussian GLM model
glm_model <- glm(lifeExp ~ gdpPercap + pop, data = gapminder, family = gaussian(link = "identity"))
# Create a dataframe with fitted values and residuals
residual_data <- gapminder %>%
mutate(
fitted = fitted(glm_model),
residuals = residuals(glm_model)
)
# Plot residuals vs fitted values
ggplot(residual_data, aes(x = fitted, y = residuals)) +
geom_point(aes(color = continent), alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
labs(
title = "Residuals vs Fitted Values for GLM Model",
x = "Fitted Values",
y = "Residuals",
color = "Continent"
) +
theme_minimal()
Notes: - This plot helps assess the linearity
and homoscedasticity assumptions of your model.
Linear regression comes with its own set of assumptions
If these assumptions are violated, our model might mislead us, leading to inaccurate predictions and faulty conclusions.
To achieve the optimal accurate results, the researcher should
testand diagnose those regression issues
sufficiently
Assumption: The relationship between predictors and the outcome should be linear.
Testing
plot(glm_model, which = 1) # Residuals vs Fitted
Assumption: Residuals should be independent.
Testing
library(car)
durbinWatsonTest(glm_model) # from 'car' package
Assumptions:: Residuals should have constant variance.
Testing
plot(glm_model, which = 3) # Scale-Location plot
Assumption: Residuals should be normally distributed.
Testing
plot(glm_model, which = 2) # Normal Q-Q plot
shapiro.test(residuals(glm_model))
Assumption: Predictors are not highly correlated.
Testing
car::vif(glm_model)
#Installing packages
install.packages(c("ggplot2", "patchwork", "performance", "car"))
#Load libraries
library(ggplot2)
library(patchwork)
library(performance)
library(car)
# Fit the GLM model
glm_model <- glm(lifeExp ~ gdpPercap + pop, data = gapminder, family = gaussian(link = "identity"))
# Create diagnostic plot together
# Residuals vs Fitted
p1 <- ggplot(glm_model, aes(.fitted, .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Residuals vs Fitted", x = "Fitted", y = "Residuals")
# Normal Q-Q Plot
p2 <- ggplot(glm_model, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
labs(title = "Normal Q-Q Plot")
# Scale-Location Plot
p3 <- ggplot(glm_model, aes(.fitted, sqrt(abs(.resid)))) +
geom_point(alpha = 0.6) +
geom_smooth(se = FALSE) +
labs(title = "Scale-Location", x = "Fitted", y = "√|Residuals|")
# Cook's Distance
p4 <- ggplot(glm_model, aes(seq_along(.cooksd), .cooksd)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Cook's Distance", x = "Observation", y = "Cook's D")
# Combine plots into dashboard
dashboard <- p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
dashboard
#Additional check for multicollinearity
vif(glm_model)
Load the penguin data
Explore it and check the relationship between adelie flipper
length and bill depth using GLM and test each
assumptions.
Hint: use library(palmerpenguins) for data
accessibility.
Panel data is the data that tracks multiple entities (e.g., individuals, firms, countries) over time.
It combines cross-sectional and
time-series dimensions.
Panel data allows us to track changes over time and control for unobserved individual effects.
Example: GDP and inflation for multiple countries over several years.
1. Balanced panel data
Definition: Every entity (e.g., individual, firm, country) is observed at all time periods.
Structure: No missing time points for any unit.
Example: GDP data for 10 countries from 2000 to 2020, with no gaps.
Advantages:
2. Unbalanced Panel Data
Definition: Some entities are not observed in all time periods.
Structure: Missing time points for some units.
Example: Some countries have GDP data only from 2005 to 2020.
Challenges:
The panel data typically has:
Entity ID (e.g., country, firm, person)
Time variable (e.g., year, month)
Variables of interest (e.g., GDP, income, health)
Use the
plmpackage
library(plm)
gapminder_p <- pdata.frame(gapminder, index = c("country", "year"))
edu_p <- pdata.frame(edu_data, index = c("student_id", "year"))
Note: - Balanced panels have equal time points per unit. Unbalanced panels are more realistic but require careful handling
Check data (Time and individual) dimensions with
pdim(gapminder_p) and pdim(edu_p)
Understanding panel structure helps in choosing appropriate models and diagnostics.
Panel data analysis combines the strengths of
time seriesandcross-sectional data
Modeling panel data involves choosing the right estimator to account for the structure of repeated observations over time for each unit (e.g., country, student).
There are three different estimators and can be implemented in
R using plm inbuit library.
#balanced Panel data
library(plm)
data("gapminder", package = "gapminder")
# Create panel data frame
gapminder_p <- pdata.frame(gapminder, index = c("country", "year"))
#unbalenced: simulated educational data
set.seed(123)
edu_data <- data.frame(
student_id = rep(1:100, each = sample(1:3, 100, replace = TRUE)),
year = sample(2015:2020, 300, replace = TRUE),
score = rnorm(300, mean = 75, sd = 10),
gender = sample(c("Male", "Female"), 300, replace = TRUE)
)
edu_p <- pdata.frame(edu_data, index = c("student_id", "year"))
#Step2: Estimate models
#1. Pooled OLS
pooled_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "pooling")
summary(pooled_model)
#2.Fixed Effect (with estimator)
fe_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "within")
summary(fe_model)
#3. random Effect
re_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "random")
summary(re_model)
#Step 3: Model comparison
hausman_test <- phtest(fe_model, re_model)
print(hausman_test)
Note: - Comparison of FE and RE coefficients
Fixed effects control for time-invariant characteristics.
Random effects assume no correlation with predictors.
Interpretation:
If p < 0.05, prefer Fixed Effects (RE assumptions violated).
If p ≥ 0.05, Random Effects may be more efficient.