Portfolio

#Introduction

Brief introduction summarizing your learning goals and outcomes from the Statistical Modeling course.

Clearly outline how your portfolio demonstrates mastery of each learning objective.

##Demonstration of Course Learning Objectives (Show & Tell) ##Objective 1: Probability and Foundations of Statistical Modeling

##Objective 2: Generalized Linear Models Application

POISSON REGRESSION

####SHOW - Problem Statement: Identify the number of sibilings a person has

Data Loading

library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(ggplot2)

file_path <- "pfi-data.xlsx"

df_2016 <- read_excel(file_path, sheet = "curated 2016")
df_2019 <- read_excel(file_path, sheet = "curated 2019")

Data Preprocessing - Principal Component Ananlysis - PCA helps to simplify complex datasets with many variables by reducing the number of dimensions while minimizing information loss

# Select numeric columns only (excluding IDs)
numeric_vars <- df_2019 %>%
  select(-BASMID) %>%
  select(-NUMSIBSX)

# Run PCA on numeric variables
pca_result <- prcomp(numeric_vars, scale. = TRUE)

var_explained <- pca_result$sdev^2


# Set up a larger plot with zoomed-in range
par(mar = c(5, 5, 4, 2))  # Increase margins for labels
plot(var_explained[1:20], type = "b", pch = 19, col = "darkblue",
     xlab = "Principal Components",
     ylab = "Variance Explained",
     main = "Scree Plot (Zoomed In for Elbow Detection)",
     cex.lab = 1.5, cex.axis = 1.2, cex.main = 1.6,
     xaxt = "n")
axis(1, at = 1:20, labels = paste0("PC", 1:20), cex.axis = 1)

# Optional: Add elbow guideline
abline(v = 3, col = "red", lty = 2)

We can clearly see an elbow at PC3 so lets consider PC1, PC2 and PC3

summary(pca_result)

## Importance of components:
##                            PC1     PC2     PC3     PC4    PC5     PC6    PC7
## Standard deviation     2.49296 2.14912 1.90431 1.83980 1.8044 1.66710 1.5869
## Proportion of Variance 0.08514 0.06327 0.04968 0.04637 0.0446 0.03807 0.0345
## Cumulative Proportion  0.08514 0.14841 0.19808 0.24445 0.2890 0.32712 0.3616
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.52209 1.37111 1.31978 1.29128 1.23484 1.14164 1.11377
## Proportion of Variance 0.03174 0.02575 0.02386 0.02284 0.02089 0.01785 0.01699
## Cumulative Proportion  0.39335 0.41911 0.44297 0.46581 0.48670 0.50455 0.52154
##                           PC15    PC16    PC17   PC18    PC19    PC20    PC21
## Standard deviation     1.09326 1.03899 1.01209 1.0072 1.00032 0.99071 0.97935
## Proportion of Variance 0.01637 0.01479 0.01403 0.0139 0.01371 0.01345 0.01314
## Cumulative Proportion  0.53792 0.55270 0.56674 0.5806 0.59434 0.60778 0.62092
##                           PC22    PC23    PC24    PC25    PC26   PC27    PC28
## Standard deviation     0.96265 0.95831 0.94577 0.93016 0.92387 0.9122 0.90131
## Proportion of Variance 0.01269 0.01258 0.01225 0.01185 0.01169 0.0114 0.01113
## Cumulative Proportion  0.63362 0.64620 0.65845 0.67030 0.68200 0.6934 0.70452
##                           PC29    PC30    PC31    PC32    PC33   PC34    PC35
## Standard deviation     0.89088 0.88607 0.88062 0.87069 0.86356 0.8543 0.85171
## Proportion of Variance 0.01087 0.01076 0.01062 0.01038 0.01022 0.0100 0.00994
## Cumulative Proportion  0.71540 0.72615 0.73677 0.74716 0.75737 0.7674 0.77731
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.84604 0.83514 0.82457 0.81563 0.80796 0.79764 0.79462
## Proportion of Variance 0.00981 0.00955 0.00931 0.00911 0.00894 0.00872 0.00865
## Cumulative Proportion  0.78711 0.79667 0.80598 0.81510 0.82404 0.83275 0.84140
##                           PC43    PC44    PC45   PC46    PC47    PC48    PC49
## Standard deviation     0.78368 0.77646 0.76842 0.7594 0.75027 0.73745 0.72690
## Proportion of Variance 0.00841 0.00826 0.00809 0.0079 0.00771 0.00745 0.00724
## Cumulative Proportion  0.84982 0.85807 0.86616 0.8741 0.88177 0.88922 0.89646
##                           PC50    PC51    PC52   PC53    PC54    PC55    PC56
## Standard deviation     0.71246 0.70376 0.69670 0.6780 0.67169 0.65823 0.65547
## Proportion of Variance 0.00695 0.00678 0.00665 0.0063 0.00618 0.00594 0.00589
## Cumulative Proportion  0.90342 0.91020 0.91685 0.9232 0.92933 0.93526 0.94115
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.64475 0.64055 0.61507 0.60653 0.59261 0.58543 0.54940
## Proportion of Variance 0.00569 0.00562 0.00518 0.00504 0.00481 0.00469 0.00413
## Cumulative Proportion  0.94684 0.95246 0.95765 0.96269 0.96750 0.97219 0.97633
##                           PC64    PC65    PC66    PC67   PC68    PC69    PC70
## Standard deviation     0.54398 0.52579 0.51905 0.48912 0.4601 0.38411 0.37508
## Proportion of Variance 0.00405 0.00379 0.00369 0.00328 0.0029 0.00202 0.00193
## Cumulative Proportion  0.98038 0.98417 0.98786 0.99113 0.9940 0.99605 0.99798
##                          PC71    PC72      PC73
## Standard deviation     0.3526 0.15176 2.367e-15
## Proportion of Variance 0.0017 0.00032 0.000e+00
## Cumulative Proportion  0.9997 1.00000 1.000e+00

library(rsample)  # for data splitting
library(dplyr)

set.seed(123)  # for reproducibility

pca_scores <- as.data.frame(pca_result$x[, 1:3])
poisson_data <- cbind(pca_scores, y = df_2019$NUMSIBSX)


split <- initial_split(poisson_data, prop = 0.7)
train_data <- training(split)
test_data  <- testing(split)

poisson_model <- glm(y ~ PC1 + PC2 + PC3, 
                     data = train_data, 
                     family = poisson(link = "log"))

# Predict on test data
predicted <- predict(poisson_model, newdata = test_data, type = "response")

# Compare predicted vs actual
results <- data.frame(
  actual = test_data$y,
  predicted = predicted
)

# Calculate RMSE or other metrics
rmse <- sqrt(mean((results$actual - results$predicted)^2))
rmse

## [1] 0.9148393

str(poisson_data)

## 'data.frame':    15500 obs. of  4 variables:
##  $ PC1: num  -0.706 -2.096 -5.04 2.827 1.856 ...
##  $ PC2: num  0.412 -1.6 -1.4 -3.516 2.114 ...
##  $ PC3: num  0.125 2.08 2.423 1.305 0.609 ...
##  $ y  : num  1 1 1 1 1 0 2 1 1 0 ...

summary(poisson_data$y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.027   1.000   7.000

summary(poisson_data$predicted)

## Length  Class   Mode 
##      0   NULL   NULL

nrow(poisson_data)

## [1] 15500

poisson_data$predicted <- predict(poisson_model, newdata = poisson_data, type = "response")

poisson_plot_data <- poisson_data %>%
  filter(!is.na(y), !is.na(predicted))

ggplot(poisson_plot_data, aes(x = y, y = predicted)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = paste("Poisson Regression: Actual vs Predicted NUMSIBSX (RMSE =", round(rmse, 3), ")"),
    x = "Actual Number of Siblings",
    y = "Predicted Number of Siblings"
  ) +
  theme_minimal()

####Tell - “Most predicted sibling counts fall between 0.6–1.2, even when the actual number is higher. This indicates that the top three PCA components do not fully explain the variance in NUMSIBSX, though the model performs reasonably well on average (RMSE = 0.915).”

What This Means The model is not overfitting, but it’s also not capturing variation well for large sibling numbers.

This is typical of Poisson models when predictors (in your case, PCA components) don’t strongly explain the outcome.

The RMSE of ~0.91 still suggests that your model’s average error is small — but mostly because the majority of students have low sibling counts (0–2).

##Objective 3: Model Selection and Comparison

Show:

Code demonstrating model selection techniques (forward, backward, stepwise selection) or cross-validation.

Example: Comparison of multiple regression models using AIC, BIC, or cross-validation methods.

Tell:

Clearly articulate your criteria for model selection and discuss how your chosen methods addressed model performance and complexity.

Reflect on the pitfalls of automatic selection procedures you encountered or avoided.

##Objective 4: Communicating Statistical Results

Show:

Provide an example of visualization or a concise summary table aimed at general audiences.

Example: Clearly labeled ggplot visualizations or regression summaries.

Tell:

Explain your rationale behind chosen visualizations or summary statistics.

Reflect on how you tailored your communication for clarity and accessibility for a non-technical audience.

##Objective 5: Programming Software Proficiency

Show:

Include well-commented Tidy Models code for a complex analysis, such as polynomial regression combined with qualitative predictors.

Tell:

Discuss your approach to code organization, reproducibility, and best practices you adopted using Tidy Models.

Explain your debugging and validation steps for ensuring accurate and reliable statistical analyses.

Specific Method Demonstrations

Ensure the following required methods are clearly demonstrated:

Multiple Regression: Quantitative and qualitative predictors, interactions, and transformations.

Multinomial Logistic Regression: Clearly articulated with multiple predictors.

Linear Discriminant Analysis or Poisson Regression: Clearly justified choice based on your dataset.

Regularization or PCA Regression: Showcase Ridge, Lasso, or PCA regression with code.

Polynomial Regression: Clearly presented with evaluation metrics.

Reflection on Course Participation

Reflect specifically on your active participation beyond assignments.

Describe your contributions to class discussions, group work, peer feedback, and any collaborative learning.

Conclusion

Summarize key learnings and how this course shaped your understanding and capabilities in statistical modeling.

Portfolio

Praharsha Bejju

2025-04-05

POISSON REGRESSION

Data Loading

Data Preprocessing - Principal Component Ananlysis - PCA helps to simplify complex datasets with many variables by reducing the number of dimensions while minimizing information loss