#Introduction
Brief introduction summarizing your learning goals and outcomes from the Statistical Modeling course.
Clearly outline how your portfolio demonstrates mastery of each learning objective.
##Demonstration of Course Learning Objectives (Show & Tell) ##Objective 1: Probability and Foundations of Statistical Modeling
##Objective 2: Generalized Linear Models Application
####SHOW - Problem Statement: Identify the number of sibilings a person has
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(ggplot2)
file_path <- "pfi-data.xlsx"
df_2016 <- read_excel(file_path, sheet = "curated 2016")
df_2019 <- read_excel(file_path, sheet = "curated 2019")
# Select numeric columns only (excluding IDs)
numeric_vars <- df_2019 %>%
select(-BASMID) %>%
select(-NUMSIBSX)
# Run PCA on numeric variables
pca_result <- prcomp(numeric_vars, scale. = TRUE)
var_explained <- pca_result$sdev^2
# Set up a larger plot with zoomed-in range
par(mar = c(5, 5, 4, 2)) # Increase margins for labels
plot(var_explained[1:20], type = "b", pch = 19, col = "darkblue",
xlab = "Principal Components",
ylab = "Variance Explained",
main = "Scree Plot (Zoomed In for Elbow Detection)",
cex.lab = 1.5, cex.axis = 1.2, cex.main = 1.6,
xaxt = "n")
axis(1, at = 1:20, labels = paste0("PC", 1:20), cex.axis = 1)
# Optional: Add elbow guideline
abline(v = 3, col = "red", lty = 2)
We can clearly see an elbow at PC3 so lets consider PC1, PC2 and PC3
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.49296 2.14912 1.90431 1.83980 1.8044 1.66710 1.5869
## Proportion of Variance 0.08514 0.06327 0.04968 0.04637 0.0446 0.03807 0.0345
## Cumulative Proportion 0.08514 0.14841 0.19808 0.24445 0.2890 0.32712 0.3616
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.52209 1.37111 1.31978 1.29128 1.23484 1.14164 1.11377
## Proportion of Variance 0.03174 0.02575 0.02386 0.02284 0.02089 0.01785 0.01699
## Cumulative Proportion 0.39335 0.41911 0.44297 0.46581 0.48670 0.50455 0.52154
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.09326 1.03899 1.01209 1.0072 1.00032 0.99071 0.97935
## Proportion of Variance 0.01637 0.01479 0.01403 0.0139 0.01371 0.01345 0.01314
## Cumulative Proportion 0.53792 0.55270 0.56674 0.5806 0.59434 0.60778 0.62092
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.96265 0.95831 0.94577 0.93016 0.92387 0.9122 0.90131
## Proportion of Variance 0.01269 0.01258 0.01225 0.01185 0.01169 0.0114 0.01113
## Cumulative Proportion 0.63362 0.64620 0.65845 0.67030 0.68200 0.6934 0.70452
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.89088 0.88607 0.88062 0.87069 0.86356 0.8543 0.85171
## Proportion of Variance 0.01087 0.01076 0.01062 0.01038 0.01022 0.0100 0.00994
## Cumulative Proportion 0.71540 0.72615 0.73677 0.74716 0.75737 0.7674 0.77731
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.84604 0.83514 0.82457 0.81563 0.80796 0.79764 0.79462
## Proportion of Variance 0.00981 0.00955 0.00931 0.00911 0.00894 0.00872 0.00865
## Cumulative Proportion 0.78711 0.79667 0.80598 0.81510 0.82404 0.83275 0.84140
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.78368 0.77646 0.76842 0.7594 0.75027 0.73745 0.72690
## Proportion of Variance 0.00841 0.00826 0.00809 0.0079 0.00771 0.00745 0.00724
## Cumulative Proportion 0.84982 0.85807 0.86616 0.8741 0.88177 0.88922 0.89646
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.71246 0.70376 0.69670 0.6780 0.67169 0.65823 0.65547
## Proportion of Variance 0.00695 0.00678 0.00665 0.0063 0.00618 0.00594 0.00589
## Cumulative Proportion 0.90342 0.91020 0.91685 0.9232 0.92933 0.93526 0.94115
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.64475 0.64055 0.61507 0.60653 0.59261 0.58543 0.54940
## Proportion of Variance 0.00569 0.00562 0.00518 0.00504 0.00481 0.00469 0.00413
## Cumulative Proportion 0.94684 0.95246 0.95765 0.96269 0.96750 0.97219 0.97633
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.54398 0.52579 0.51905 0.48912 0.4601 0.38411 0.37508
## Proportion of Variance 0.00405 0.00379 0.00369 0.00328 0.0029 0.00202 0.00193
## Cumulative Proportion 0.98038 0.98417 0.98786 0.99113 0.9940 0.99605 0.99798
## PC71 PC72 PC73
## Standard deviation 0.3526 0.15176 2.367e-15
## Proportion of Variance 0.0017 0.00032 0.000e+00
## Cumulative Proportion 0.9997 1.00000 1.000e+00
library(rsample) # for data splitting
library(dplyr)
set.seed(123) # for reproducibility
pca_scores <- as.data.frame(pca_result$x[, 1:3])
poisson_data <- cbind(pca_scores, y = df_2019$NUMSIBSX)
split <- initial_split(poisson_data, prop = 0.7)
train_data <- training(split)
test_data <- testing(split)
poisson_model <- glm(y ~ PC1 + PC2 + PC3,
data = train_data,
family = poisson(link = "log"))
# Predict on test data
predicted <- predict(poisson_model, newdata = test_data, type = "response")
# Compare predicted vs actual
results <- data.frame(
actual = test_data$y,
predicted = predicted
)
# Calculate RMSE or other metrics
rmse <- sqrt(mean((results$actual - results$predicted)^2))
rmse
## [1] 0.9148393
str(poisson_data)
## 'data.frame': 15500 obs. of 4 variables:
## $ PC1: num -0.706 -2.096 -5.04 2.827 1.856 ...
## $ PC2: num 0.412 -1.6 -1.4 -3.516 2.114 ...
## $ PC3: num 0.125 2.08 2.423 1.305 0.609 ...
## $ y : num 1 1 1 1 1 0 2 1 1 0 ...
summary(poisson_data$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.027 1.000 7.000
summary(poisson_data$predicted)
## Length Class Mode
## 0 NULL NULL
nrow(poisson_data)
## [1] 15500
poisson_data$predicted <- predict(poisson_model, newdata = poisson_data, type = "response")
poisson_plot_data <- poisson_data %>%
filter(!is.na(y), !is.na(predicted))
ggplot(poisson_plot_data, aes(x = y, y = predicted)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = paste("Poisson Regression: Actual vs Predicted NUMSIBSX (RMSE =", round(rmse, 3), ")"),
x = "Actual Number of Siblings",
y = "Predicted Number of Siblings"
) +
theme_minimal()
####Tell - “Most predicted sibling counts fall between 0.6–1.2, even
when the actual number is higher. This indicates that the top three PCA
components do not fully explain the variance in NUMSIBSX, though the
model performs reasonably well on average (RMSE = 0.915).”
What This Means The model is not overfitting, but it’s also not capturing variation well for large sibling numbers.
This is typical of Poisson models when predictors (in your case, PCA components) don’t strongly explain the outcome.
The RMSE of ~0.91 still suggests that your model’s average error is small — but mostly because the majority of students have low sibling counts (0–2).
##Objective 3: Model Selection and Comparison
Show:
Code demonstrating model selection techniques (forward, backward, stepwise selection) or cross-validation.
Example: Comparison of multiple regression models using AIC, BIC, or cross-validation methods.
Tell:
Clearly articulate your criteria for model selection and discuss how your chosen methods addressed model performance and complexity.
Reflect on the pitfalls of automatic selection procedures you encountered or avoided.
##Objective 4: Communicating Statistical Results
Show:
Provide an example of visualization or a concise summary table aimed at general audiences.
Example: Clearly labeled ggplot visualizations or regression summaries.
Tell:
Explain your rationale behind chosen visualizations or summary statistics.
Reflect on how you tailored your communication for clarity and accessibility for a non-technical audience.
##Objective 5: Programming Software Proficiency
Show:
Include well-commented Tidy Models code for a complex analysis, such as polynomial regression combined with qualitative predictors.
Tell:
Discuss your approach to code organization, reproducibility, and best practices you adopted using Tidy Models.
Explain your debugging and validation steps for ensuring accurate and reliable statistical analyses.
Specific Method Demonstrations
Ensure the following required methods are clearly demonstrated:
Multiple Regression: Quantitative and qualitative predictors, interactions, and transformations.
Multinomial Logistic Regression: Clearly articulated with multiple predictors.
Linear Discriminant Analysis or Poisson Regression: Clearly justified choice based on your dataset.
Regularization or PCA Regression: Showcase Ridge, Lasso, or PCA regression with code.
Polynomial Regression: Clearly presented with evaluation metrics.
Reflection on Course Participation
Reflect specifically on your active participation beyond assignments.
Describe your contributions to class discussions, group work, peer feedback, and any collaborative learning.
Conclusion
Summarize key learnings and how this course shaped your understanding and capabilities in statistical modeling.