Advanced Data Analysis in R

Statistical Inference

Inferential statistics is a branch of statistics used to draw conclusions and make inferences about a larger population based on collected sample data.It has 2 main types or methods it uses which are hypothesis testing and regression analysis.

📊 Section I: Hypothesis Testing

Hypothesis testing is a statistical method used to test the strength of an assumption from a hypothesis.
It helps determine if there is enough evidence from sample data to draw conclusions about a larger population.

Steps for Hypothesis testing

State a null hypothesis (Ho) and alternative hypothesis (Ha or H1).
Choose the significance level ($\alpha$).
Collect data for the statistical test.
Perform an appropriate statistical test and calculate a test statistic.
Compare the test statistic against the critical value.
Reject or fail to reject the null hypothesis based on findings.

Types of Statistical Tests in Hypothesis Testing

Z-test: T compares the means of two groups (used when standard deviation is known and sample is large). And it assumes that the sample is normally distributed. We use this test to validate a hypothesis that states the sample belongs to the same population.

Null: Sample mean is same as the population mean.

Alternate: Sample mean is not same as the population mean. Mathematically is calculated by:

\[ z = (x — μ) / (σ / √n) \]

Where:$x$ is sample mean,$\mu$ is population mean and $σ / √n $ is population standard deviation

NOTE: If the test statistic is lower than the critical value, accept the hypothesis.

T-test: Compares the means of two groups (used when standard deviation is unknown and/or sample is small).
Chi-square test: Compares categorical variables, like determining whether sample data matches population data (chi-square goodness of fit test) or if two categorical variables are related (chi-square test of independence).
ANOVA (Analysis of Variance): Compares the difference between three or more groups of a single independent variable (one-way ANOVA), or tests the effect of one or more independent variables on two or more dependent variables (MANOVA).

Note: For details and mathematical equations visit HERE

Implementing in R

T-Test/ANOVA

#Test of differences and compare means across groups.
# Example using gapminder
library(gapminder)
library(dplyr)

# t-test: Life expectancy between two continents
gapminder %>%
  filter(continent %in% c("Asia", "Europe")) %>%
  t.test(lifeExp ~ continent, data = .)

# ANOVA: Life expectancy across all continents
aov_model <- aov(lifeExp ~ continent, data = gapminder)
summary(aov_model)

Chi-square Test

#Objective: Test association (independence) between categorical variables.
#Tests whether continent group and year are associated.
gapminder %>%
  mutate(continent_group = ifelse(continent %in% c("Asia", "Europe"), "Group1", "Group2")) %>%
  count(continent_group, year) %>%
  xtabs(~ continent_group + year) %>%
  chisq.test()

This test checks if Asia has significantly higher life expectancy than Europe. Directional hypotheses are useful for policy-driven questions

Hypothesis Testing

#Objective: Formalize statistical decision-making.
# Hypothesis: Life expectancy in Asia > Europe
asia <- gapminder %>% filter(continent == "Asia") %>% pull(lifeExp)
europe <- gapminder %>% filter(continent == "Europe") %>% pull(lifeExp)

t.test(asia, europe, alternative = "greater")

Exercise

Simulate the educational data with student_id, year, score, gender columns
Draw the hypothesis to be tested
Test the difference between gender and scores

set.seed(123)
edu_data <- data.frame(
  student_id = rep(1:100, each = sample(1:3, 100, replace = TRUE)),
  year = sample(2015:2020, 300, replace = TRUE),
  score = rnorm(300, mean = 75, sd = 10),
  gender = sample(c("Male", "Female"), 300, replace = TRUE)
)

# t-test: Gender difference in scores
t.test(score ~ gender, data = edu_data)

📈 Section 2: Regression Analysis

Linear regression

Objective: Predicting Life Expectancy

Key Concepts: Continuous outcome and Multiple predictors

lm_model <- lm(lifeExp ~ gdpPercap + pop, data = gapminder)
summary(lm_model)


# Create scatterplot with regression line (based on gdpPercap only for visualization)
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ x, color = "black") +
  labs(
    title = "Life Expectancy vs GDP per Capita",
    x = "GDP per Capita",
    y = "Life Expectancy"
  ) +
  theme_minimal()

Note: This linear regression helps us to quantify how GDP and population influence life expectancy. Coefficients indicate direction and magnitude.

Binary Regression

Objective: Modeling Binary Outcomes

Key Concepts: Dichotomous dependent variable and Logistic regression

# Load required libraries
library(ggplot2)
library(dplyr)
library(gapminder)

# Create binary outcome variable
gapminder <- gapminder %>%
  mutate(high_lifeExp = ifelse(lifeExp > 70, 1, 0))

# Fit logistic regression model
binary_model <- glm(high_lifeExp ~ gdpPercap + pop, data = gapminder, family = binomial)

summary(binary_model)

# Add predicted probabilities to the dataset
gapminder <- gapminder %>%
  mutate(predicted_prob = predict(binary_model, type = "response"))

# Plot predicted probabilities vs GDP per Capita
ggplot(gapminder, aes(x = gdpPercap, y = predicted_prob)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "black") +
  labs(
    title = "Predicted Probability of High Life Expectancy vs GDP per Capita",
    x = "GDP per Capita",
    y = "Predicted Probability",
    color = "Continent"
  ) +
  theme_minimal()

Note: Binary regression is ideal for yes/no outcomes. And here we predict the likelihood of high life expectancy.

Logistic Regression

Objective: Logistic Model with Logit Link

Key Concepts:Logit transformation and Odds interpretation

# Create binary outcome variable
gapminder <- gapminder %>%
  mutate(high_lifeExp = ifelse(lifeExp > 70, 1, 0))

# Fit logistic regression model with logit link
logit_model <- glm(high_lifeExp ~ gdpPercap + pop, data = gapminder, family = binomial(link = "logit"))

# Add predicted log-odds to the dataset
gapminder <- gapminder %>%
  mutate(log_odds = predict(logit_model, type = "link"))

# Plot log-odds vs GDP per Capita
ggplot(gapminder, aes(x = gdpPercap, y = log_odds)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "black") +
  labs(
    title = "Log-Odds of High Life Expectancy vs GDP per Capita",
    x = "GDP per Capita",
    y = "Log-Odds (Logit)",
    color = "Continent"
  ) +
  theme_minimal()

Note: Logistic regression uses the logit link to model probabilities. Coefficients reflect odds ratios.

GLM

Objective: Generalized Linear Models

GLMs extend linear models to non-normal outcomes. Here, we gonna use the identity link for a standard regression
Key Concepts:Flexible distributions and Link functions

# Fit Gaussian GLM model
glm_model <- glm(lifeExp ~ gdpPercap + pop, data = gapminder, family = gaussian(link = "identity"))

# Create a dataframe with fitted values and residuals
residual_data <- gapminder %>%
  mutate(
    fitted = fitted(glm_model),
    residuals = residuals(glm_model)
  )

# Plot residuals vs fitted values
ggplot(residual_data, aes(x = fitted, y = residuals)) +
  geom_point(aes(color = continent), alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title = "Residuals vs Fitted Values for GLM Model",
    x = "Fitted Values",
    y = "Residuals",
    color = "Continent"
  ) +
  theme_minimal()

Notes: - This plot helps assess the linearity and homoscedasticity assumptions of your model.

Patterns or trends in residuals may suggest model misspecification.

Problems with Regression

Linear regression comes with its own set of assumptions
If these assumptions are violated, our model might mislead us, leading to inaccurate predictions and faulty conclusions.
To achieve the optimal accurate results, the researcher should testand diagnose those regression issues sufficiently

Linearity

Assumption: The relationship between predictors and the outcome should be linear.
Testing

plot(glm_model, which = 1)  # Residuals vs Fitted

Treatment: Transformations (e.g., log(x)).

Independence of Errors

Assumption: Residuals should be independent.
Testing

library(car)
durbinWatsonTest(glm_model)  # from 'car' package

Treatment: Use time-series models (e.g., ARIMA) or mixed models if data is clustered.

Homoscedasticity (Constant Variance)

Assumptions:: Residuals should have constant variance.
Testing

plot(glm_model, which = 3)  # Scale-Location plot

Treatment: Transform the response variable.

Normality of Residuals

Assumption: Residuals should be normally distributed.
Testing

plot(glm_model, which = 2)  # Normal Q-Q plot
shapiro.test(residuals(glm_model))

Treatment: Transform the response or use robust regression.

Multicollinearity

Assumption: Predictors are not highly correlated.
Testing

car::vif(glm_model)

Treatment: Remove or combine correlated variables, or use PCA/ridge regression.

Bonus/Diagnostic Dashboard

#Installing packages
install.packages(c("ggplot2", "patchwork", "performance", "car"))
#Load libraries
library(ggplot2)
library(patchwork)
library(performance)
library(car)
# Fit the GLM model
glm_model <- glm(lifeExp ~ gdpPercap + pop, data = gapminder, family = gaussian(link = "identity"))
# Create diagnostic plot together
# Residuals vs Fitted
p1 <- ggplot(glm_model, aes(.fitted, .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted", y = "Residuals")

# Normal Q-Q Plot
p2 <- ggplot(glm_model, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Normal Q-Q Plot")

# Scale-Location Plot
p3 <- ggplot(glm_model, aes(.fitted, sqrt(abs(.resid)))) +
  geom_point(alpha = 0.6) +
  geom_smooth(se = FALSE) +
  labs(title = "Scale-Location", x = "Fitted", y = "√|Residuals|")

# Cook's Distance
p4 <- ggplot(glm_model, aes(seq_along(.cooksd), .cooksd)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Cook's Distance", x = "Observation", y = "Cook's D")

# Combine plots into dashboard
dashboard <- p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
dashboard
#Additional check for multicollinearity
vif(glm_model)

Labs

Load the penguin data
Explore it and check the relationship between adelie flipper length and bill depth using GLM and test each assumptions.
Hint: use library(palmerpenguins) for data accessibility.

Section III: Panel Data Analysis

Meaning

Panel data is the data that tracks multiple entities (e.g., individuals, firms, countries) over time.
It combines cross-sectional and time-series dimensions.
Panel data allows us to track changes over time and control for unobserved individual effects.
Example: GDP and inflation for multiple countries over several years.

Types

1. Balanced panel data

Definition: Every entity (e.g., individual, firm, country) is observed at all time periods.
Structure: No missing time points for any unit.
Example: GDP data for 10 countries from 2000 to 2020, with no gaps.
Advantages:
- Easier to analyze.
- More efficient estimators.

2. Unbalanced Panel Data

Definition: Some entities are not observed in all time periods.
Structure: Missing time points for some units.
Example: Some countries have GDP data only from 2005 to 2020.
Challenges:
- More complex modeling.
- Potential bias if missingness is not random.

Structure

The panel data typically has:

Entity ID (e.g., country, firm, person)
Time variable (e.g., year, month)
Variables of interest (e.g., GDP, income, health)

Dataset Conversion into panel data

Use the plm package

library(plm)
gapminder_p <- pdata.frame(gapminder, index = c("country", "year"))
edu_p <- pdata.frame(edu_data, index = c("student_id", "year"))

Note: - Balanced panels have equal time points per unit. Unbalanced panels are more realistic but require careful handling

Check data (Time and individual) dimensions with pdim(gapminder_p) and pdim(edu_p)
Understanding panel structure helps in choosing appropriate models and diagnostics.

Modelling

Panel data analysis combines the strengths of time series and cross-sectional data

Modeling panel data involves choosing the right estimator to account for the structure of repeated observations over time for each unit (e.g., country, student).
There are three different estimators and can be implemented in R using plm inbuit library.
- Pooled OLS,
- Fixed Effects (FE), and
- Random Effects (RE)

Implementation

#balanced Panel data
library(plm)
data("gapminder", package = "gapminder")

# Create panel data frame
gapminder_p <- pdata.frame(gapminder, index = c("country", "year"))

#unbalenced: simulated educational data
set.seed(123)
edu_data <- data.frame(
  student_id = rep(1:100, each = sample(1:3, 100, replace = TRUE)),
  year = sample(2015:2020, 300, replace = TRUE),
  score = rnorm(300, mean = 75, sd = 10),
  gender = sample(c("Male", "Female"), 300, replace = TRUE)
)

edu_p <- pdata.frame(edu_data, index = c("student_id", "year"))

#Step2: Estimate models

#1. Pooled OLS
pooled_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "pooling")
summary(pooled_model)

#2.Fixed Effect (with estimator)
fe_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "within")
summary(fe_model)

#3. random Effect

re_model <- plm(lifeExp ~ gdpPercap + pop, data = gapminder_p, model = "random")
summary(re_model)

#Step 3: Model comparison
hausman_test <- phtest(fe_model, re_model)
print(hausman_test)

Note: - Comparison of FE and RE coefficients

Fixed effects control for time-invariant characteristics.
Random effects assume no correlation with predictors.

Interpretation:

If p < 0.05, prefer Fixed Effects (RE assumptions violated).
If p ≥ 0.05, Random Effects may be more efficient.

Graphing Panel data

Basic Viz
Advanced analysis and visualization of gapminder