Statistical Inference

Inferential statistics is a branch of statistics used to draw conclusions and make inferences about a larger population based on collected sample data.It has 2 main types or methods it uses which are hypothesis testing and regression analysis.

📊 Section I: Hypothesis Testing

  • Hypothesis testing is a statistical method used to test the strength of an assumption from a hypothesis.
  • It helps determine if there is enough evidence from sample data to draw conclusions about a larger population.

Steps for Hypothesis testing

  • State a null hypothesis (Ho) and alternative hypothesis (Ha or H1).
  • Choose the significance level (\(\alpha\)).
  • Collect data for the statistical test.
  • Perform an appropriate statistical test and calculate a test statistic.
  • Compare the test statistic against the critical value.
  • Reject or fail to reject the null hypothesis based on findings.

Types of Statistical Tests in Hypothesis Testing

  • Z-test: T compares the means of two groups (used when standard deviation is known and sample is large). And it assumes that the sample is normally distributed. We use this test to validate a hypothesis that states the sample belongs to the same population.

Null: Sample mean is same as the population mean.

Alternate: Sample mean is not same as the population mean. Mathematically is calculated by:

\[ z = (x — μ) / (σ / √n) \]

Where:\(x\) is sample mean,\(\mu\) is population mean and $σ / √n $ is population standard deviation

NOTE: If the test statistic is lower than the critical value, accept the hypothesis.

  • T-test: Compares the means of two groups (used when standard deviation is unknown and/or sample is small).

  • Chi-square test: Compares categorical variables, like determining whether sample data matches population data (chi-square goodness of fit test) or if two categorical variables are related (chi-square test of independence).

  • ANOVA (Analysis of Variance): Compares the difference between three or more groups of a single independent variable (one-way ANOVA), or tests the effect of one or more independent variables on two or more dependent variables (MANOVA).

Note: For details and mathematical equations visit HERE

Implementing in R

T-Test/ANOVA

#Test of differences and compare means across groups.
# Example using gapminder
library(gapminder)
library(dplyr)

# t-test: Life expectancy between two continents
gapminder %>%
  filter(continent %in% c("Asia", "Europe")) %>%
  t.test(lifeExp ~ continent, data = .)

# ANOVA: Life expectancy across all continents
aov_model <- aov(lifeExp ~ continent, data = gapminder)
summary(aov_model)

Chi-square Test

#Objective: Test association (independence) between categorical variables.
#Tests whether continent group and year are associated.
gapminder %>%
  mutate(continent_group = ifelse(continent %in% c("Asia", "Europe"), "Group1", "Group2")) %>%
  count(continent_group, year) %>%
  xtabs(~ continent_group + year) %>%
  chisq.test()

This test checks if Asia has significantly higher life expectancy than Europe. Directional hypotheses are useful for policy-driven questions

Hypothesis Testing

#Objective: Formalize statistical decision-making.
# Hypothesis: Life expectancy in Asia > Europe
asia <- gapminder %>% filter(continent == "Asia") %>% pull(lifeExp)
europe <- gapminder %>% filter(continent == "Europe") %>% pull(lifeExp)

t.test(asia, europe, alternative = "greater")

Exercise

  • Simulate the educational data with student_id, year, score, gender columns

  • Draw the hypothesis to be tested

  • Test the difference between gender and scores

set.seed(123)
edu_data <- data.frame(
  student_id = rep(1:100, each = sample(1:3, 100, replace = TRUE)),
  year = sample(2015:2020, 300, replace = TRUE),
  score = rnorm(300, mean = 75, sd = 10),
  gender = sample(c("Male", "Female"), 300, replace = TRUE)
)

# t-test: Gender difference in scores
t.test(score ~ gender, data = edu_data)

📈 Section 2: Regression Analysis

Linear regression

Objective: Predicting Life Expectancy

  • Key Concepts: Continuous outcome and Multiple predictors
lm_model <- lm(lifeExp ~ gdpPercap + pop, data = gapminder)
summary(lm_model)


# Create scatterplot with regression line (based on gdpPercap only for visualization)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ x, color = "black") +
  labs(
    title = "Life Expectancy vs GDP per Capita",
    x = "GDP per Capita",
    y = "Life Expectancy"
  ) +
  theme_minimal()

Note: This linear regression helps us to quantify how GDP and population influence life expectancy. Coefficients indicate direction and magnitude.

Binary Regression

Objective: Modeling Binary Outcomes

  • Key Concepts: Dichotomous dependent variable and Logistic regression
# Load required libraries
library(ggplot2)
library(dplyr)
library(gapminder)

# Create binary outcome variable
gapminder <- gapminder %>%
  mutate(high_lifeExp = ifelse(lifeExp > 70, 1, 0))

# Fit logistic regression model
binary_model <- glm(high_lifeExp ~ gdpPercap + pop, data = gapminder, family = binomial)

summary(binary_model)

# Add predicted probabilities to the dataset
gapminder <- gapminder %>%
  mutate(predicted_prob = predict(binary_model, type = "response"))

# Plot predicted probabilities vs GDP per Capita
ggplot(gapminder, aes(x = gdpPercap, y = predicted_prob)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "black") +
  labs(
    title = "Predicted Probability of High Life Expectancy vs GDP per Capita",
    x = "GDP per Capita",
    y = "Predicted Probability",
    color = "Continent"
  ) +
  theme_minimal()

Note: Binary regression is ideal for yes/no outcomes. And here we predict the likelihood of high life expectancy.

Logistic Regression

Objective: Logistic Model with Logit Link

  • Key Concepts:Logit transformation and Odds interpretation
# Create binary outcome variable
gapminder <- gapminder %>%
  mutate(high_lifeExp = ifelse(lifeExp > 70, 1, 0))

# Fit logistic regression model with logit link
logit_model <- glm(high_lifeExp ~ gdpPercap + pop, data = gapminder, family = binomial(link = "logit"))

# Add predicted log-odds to the dataset
gapminder <- gapminder %>%
  mutate(log_odds = predict(logit_model, type = "link"))

# Plot log-odds vs GDP per Capita
ggplot(gapminder, aes(x = gdpPercap, y = log_odds)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "black") +
  labs(
    title = "Log-Odds of High Life Expectancy vs GDP per Capita",
    x = "GDP per Capita",
    y = "Log-Odds (Logit)",
    color = "Continent"
  ) +
  theme_minimal()

Note: Logistic regression uses the logit link to model probabilities. Coefficients reflect odds ratios.

GLM

Objective: Generalized Linear Models

  • GLMs extend linear models to non-normal outcomes. Here, we gonna use the identity link for a standard regression

  • Key Concepts:Flexible distributions and Link functions

# Fit Gaussian GLM model
glm_model <- glm(lifeExp ~ gdpPercap + pop, data = gapminder, family = gaussian(link = "identity"))

# Create a dataframe with fitted values and residuals
residual_data <- gapminder %>%
  mutate(
    fitted = fitted(glm_model),
    residuals = residuals(glm_model)
  )

# Plot residuals vs fitted values
ggplot(residual_data, aes(x = fitted, y = residuals)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "Residuals vs Fitted Values for GLM Model",
    x = "Fitted Values",
    y = "Residuals",
    color = "Continent"
  ) +
  theme_minimal()

Notes: - This plot helps assess the linearity and homoscedasticity assumptions of your model.

  • Patterns or trends in residuals may suggest model misspecification.

Problems with Regression

  • Linear regression comes with its own set of assumptions

  • If these assumptions are violated, our model might mislead us, leading to inaccurate predictions and faulty conclusions.

  • To achieve the optimal accurate results, the researcher should testand diagnose those regression issues sufficiently

Linearity

  • Assumption: The relationship between predictors and the outcome should be linear.

  • Testing

plot(glm_model, which = 1)  # Residuals vs Fitted
  • Treatment: Transformations (e.g., log(x)).

Independence of Errors

  • Assumption: Residuals should be independent.

  • Testing

library(car)
durbinWatsonTest(glm_model)  # from 'car' package
  • Treatment: Use time-series models (e.g., ARIMA) or mixed models if data is clustered.

Homoscedasticity (Constant Variance)

  • Assumptions:: Residuals should have constant variance.

  • Testing

plot(glm_model, which = 3)  # Scale-Location plot
  • Treatment: Transform the response variable.

Normality of Residuals

  • Assumption: Residuals should be normally distributed.

  • Testing

plot(glm_model, which = 2)  # Normal Q-Q plot
shapiro.test(residuals(glm_model))
  • Treatment: Transform the response or use robust regression.

Multicollinearity

  • Assumption: Predictors are not highly correlated.

  • Testing

car::vif(glm_model)
  • Treatment: Remove or combine correlated variables, or use PCA/ridge regression.

Bonus/Diagnostic Dashboard

#Installing packages
install.packages(c("ggplot2", "patchwork", "performance", "car"))
#Load libraries
library(ggplot2)
library(patchwork)
library(performance)
library(car)
# Fit the GLM model
glm_model <- glm(lifeExp ~ gdpPercap + pop, data = gapminder, family = gaussian(link = "identity"))
# Create diagnostic plot together
# Residuals vs Fitted
p1 <- ggplot(glm_model, aes(.fitted, .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted", y = "Residuals")

# Normal Q-Q Plot
p2 <- ggplot(glm_model, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q Plot")

# Scale-Location Plot
p3 <- ggplot(glm_model, aes(.fitted, sqrt(abs(.resid)))) +
  geom_point(alpha = 0.6) +
  geom_smooth(se = FALSE) +
  labs(title = "Scale-Location", x = "Fitted", y = "√|Residuals|")

# Cook's Distance
p4 <- ggplot(glm_model, aes(seq_along(.cooksd), .cooksd)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Cook's Distance", x = "Observation", y = "Cook's D")

# Combine plots into dashboard
dashboard <- p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
dashboard
#Additional check for multicollinearity
vif(glm_model)

Labs

  • Load the penguin data

  • Explore it and check the relationship between adelie flipper length and bill depth using GLM and test each assumptions.

  • Hint: use library(palmerpenguins) for data accessibility.

Section III: Panel Data Analysis

Meaning

  • Panel data is the data that tracks multiple entities (e.g., individuals, firms, countries) over time.

  • It combines cross-sectional and time-series dimensions.

  • Panel data allows us to track changes over time and control for unobserved individual effects.

  • Example: GDP and inflation for multiple countries over several years.

Types

1. Balanced panel data

  • Definition: Every entity (e.g., individual, firm, country) is observed at all time periods.

  • Structure: No missing time points for any unit.

  • Example: GDP data for 10 countries from 2000 to 2020, with no gaps.

  • Advantages:

    • Easier to analyze.
    • More efficient estimators.

2. Unbalanced Panel Data

  • Definition: Some entities are not observed in all time periods.

  • Structure: Missing time points for some units.

  • Example: Some countries have GDP data only from 2005 to 2020.

  • Challenges:

    • More complex modeling.
    • Potential bias if missingness is not random.

Structure

The panel data typically has:

  • Entity ID (e.g., country, firm, person)

  • Time variable (e.g., year, month)

  • Variables of interest (e.g., GDP, income, health)

Dataset Conversion into panel data

Use the plm package

library(plm)
gapminder_p <- pdata.frame(gapminder, index = c("country", "year"))
edu_p <- pdata.frame(edu_data, index = c("student_id", "year"))

Note: - Balanced panels have equal time points per unit. Unbalanced panels are more realistic but require careful handling

  • Check data (Time and individual) dimensions with pdim(gapminder_p) and pdim(edu_p)

  • Understanding panel structure helps in choosing appropriate models and diagnostics.

Modelling

Panel data analysis combines the strengths of time series and cross-sectional data

  • Modeling panel data involves choosing the right estimator to account for the structure of repeated observations over time for each unit (e.g., country, student).

  • There are three different estimators and can be implemented in R using plm inbuit library.

    • Pooled OLS,
    • Fixed Effects (FE), and
    • Random Effects (RE)

Implementation

#balanced Panel data
library(plm)
data("gapminder", package = "gapminder")

# Create panel data frame
gapminder_p <- pdata.frame(gapminder, index = c("country", "year"))

#unbalenced: simulated educational data
set.seed(123)
edu_data <- data.frame(
  student_id = rep(1:100, each = sample(1:3, 100, replace = TRUE)),
  year = sample(2015:2020, 300, replace = TRUE),
  score = rnorm(300, mean = 75, sd = 10),
  gender = sample(c("Male", "Female"), 300, replace = TRUE)
)

edu_p <- pdata.frame(edu_data, index = c("student_id", "year"))

#Step2: Estimate models

#1. Pooled OLS
pooled_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "pooling")
summary(pooled_model)

#2.Fixed Effect (with estimator)
fe_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "within")
summary(fe_model)

#3. random Effect

re_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "random")
summary(re_model)

#Step 3: Model comparison
hausman_test <- phtest(fe_model, re_model)
print(hausman_test)

Note: - Comparison of FE and RE coefficients

  • Fixed effects control for time-invariant characteristics.

  • Random effects assume no correlation with predictors.

Interpretation:

  • If p < 0.05, prefer Fixed Effects (RE assumptions violated).

  • If p ≥ 0.05, Random Effects may be more efficient.

Graphing Panel data

  • Basic Viz
  • Advanced analysis and visualization of gapminder