STA 631 · Statistical Modeling I · Winter 2026

Hello, I'm Udita Bista

I'm a graduate student pursuing a Master's in Data Science and Analytics at Grand Valley State University, with a background in Computer Engineering. This portfolio brings together the modeling work I completed in Statistical Modeling I - a journey through regression, classification, regularization, and model validation, all built with R and Tidymodels. Throughout the semester, I worked extensively with R to analyze real-world datasets and build statistical models, moving from simple ideas to more complex techniques.

Section 01

Introduction

Name

Udita Bista

Program

M.S. Data Science & Analytics

University

Grand Valley State University

Course

STA 631 — Statistical Modeling I, Winter 2026

I came into this course with a curiosity about how data can be used to explain and predict real-world phenomena. Through STA 631, I developed a structured approach to building, evaluating, and communicating statistical models. Each project in this portfolio reflects not just technical execution, but an effort to understand the story behind the data - why a model behaves the way it does, what its limitations are, and how its insights can be applied responsibly. My goal throughout was not simply to fit models, but to understand the reasoning behind each methodological choice and to communicate findings in a way that is clear, honest, and actionable.

What this course covered

STA 631 spanned the core pillars of supervised statistical learning - from the foundations of linear regression through to regularization and resampling methods. Each topic built deliberately on the last, providing both theoretical grounding and hands-on modeling practice using real datasets in R.

Simple linear regression with model diagnostics
Multiple linear regression (continuous & categorical)
Interaction terms in regression models
Detecting and handling multicollinearity
Train / test splitting for model evaluation
Logistic regression for binary classification
Multinomial logistic regression
Poisson & Negative Binomial regression
Linear & Quadratic Discriminant Analysis
Best subset & polynomial regression
Resampling methods & cross-validation

Course Learning Objectives

The following objectives guided my development throughout the semester. Each project presented in this portfolio serves as a milestone in mastering these core competencies:

1. Probability & Inference

Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation.

2. Generalized Linear Models

Apply the appropriate generalized linear model for a specific data context.

3. Model Selection

Demonstrate model selection given a set of candidate models.

4. Statistical Communication

Express the results of statistical models to a general audience.

5. Technical Proficiency

Use programming software to fit and assess statistical models.

Tools I used

RStudio R Markdown Tidymodels ISLR ggplot2 dplyr / tidyr recipes rsample RPubs

Section 02

Key skills gained

Beyond learning individual methods, this course helped me build a coherent end-to-end workflow for statistical modeling - from data preparation and feature engineering through to model tuning, validation, and interpretation. The skills below represent both technical competencies and the analytical mindset I developed over the semester.

◈

Regression modeling

Building and interpreting simple, multiple, interaction, and polynomial regression models with full diagnostic checks for assumption violations.

◈

Classification methods

Applying logistic, multinomial, and discriminant analysis models (LDA/QDA); evaluating them with ROC curves, AUC, and confusion matrices.

◈

Model selection

Using best subset selection, backward elimination, VIF analysis, AIC/BIC comparisons, and validation sets to choose among candidate models in a principled way.

◈

Model validation

Implementing train/test splits, three-way validation splits, and resampling to produce honest estimates of out-of-sample performance.

◈

Count data modeling

Applying Poisson and Negative Binomial regression to count outcomes, diagnosing overdispersion, and choosing the appropriate distributional family.

◈

Statistical communication

Translating coefficient estimates, odds ratios, and model metrics into plain-language findings accessible to non-technical audiences.

Section 03

Featured projects

Each project below was completed as part of STA 631. Together they span the full supervised modeling workflow - from exploratory analysis and feature engineering through to validation and interpretation. Click "View Report" to access the published report (it contains EDA, code implementation along with proper reasoning of doing certain steps, interpretation of result and conclusion drawn from the results), or download the source .Rmd file to run the analysis yourself.

Regression · Model Diagnostics

Simple Linear Regression

This project investigates the relationship between vehicle horsepower and fuel efficiency (mpg) using the Auto dataset. I established a statistically significant negative correlation, quantifying exactly how much fuel efficiency drops for every unit increase in horsepower. The analysis also compares model flexibility by evaluating simple linear regression against more complex cubic models, highlighting the trade-offs between training fit and overfitting risk. The study then extends to the Boston housing dataset to assess model stability and predictive power on an unseen test set using metrices such as R², Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). Model assumptions (Linearity, homoscedasticity and Normality) are rigorously checked through residual plots, Q-Q plots, and leverage diagnostics to ensure validity of inference for both models.

Regression · Interaction Terms · Feature Engineering

Multiple Regression with Interaction Terms

Using the Carseats dataset, this project builds a multiple linear regression model predicting unit sales using both a quantitative predictor (Price) and a qualitative predictor (ShelveLoc), along with their interaction. I used ggpairs visualizations to justify predictor selection(both qualitative and quantitative), fit the model with an 80/20 train/test split, and interpreted three distinct regression lines - one for each shelf location category. The analysis demonstrates how shelf location moderates the relationship between price and sales, produces 95% confidence and prediction intervals for new observations, and validates assumptions through residual plots, histograms, and Q-Q plots. Training R² was 0.57 and test R² was 0.43, revealing modest overfitting which suggests addition of new predictors to further improve the model performance.

Regression · Backward Elimination · Multicollinearity

Backward Elimination & VIF Analysis

This project applies manual backward elimination by p-value to the Auto dataset, predicting vehicle acceleration from all available quantitative predictors. Starting from the full model, I removed mpg, cylinders, and year sequentially - tracking adjusted R² at each step to confirm that parsimony improved model quality. After reaching a significant-only model, Variance Inflation Factor(VIF) analysis revealed severe multicollinearity among displacement, horsepower, and weight (VIFs up to 10+), prompting a second round of elimination that removed displacement(predictor having highest VIF). The final model predicts acceleration from horsepower and weight, with full interpretation of coefficients and 95% confidence intervals, plus residual diagnostics showing mild heteroscedasticity and mild non-linearity.

Classification · Multinomial Regression

Multinomial Logistic Regression - Voter Behavior

Using the FiveThirtyEight nonvoters dataset (n ≈ 5,800), this project applies multinomial logistic regression to predict three-class voter behavior - always, sporadic, and rarely/never - from demographic and party affiliation predictors. Since there are three labels in our dataset, I have two differennt equationd for always and sporadic voters while rarely/never is treated as a baseline category. I write out both log-odds equations with fully labelled coefficients, interpret the gender and party slopes in odds-ratio terms (e.g., Democrats have 305.9% higher odds of being always vs rarely/never voters than the reference "Other" group), and produce a detailed confusion matrix showing the model's central-category bias. The project also compares model performance with and without party affiliation and uses repair_call() to ensure augment() produces valid predictions.

Regression · Best Subset · Polynomial · Group Project

Predicting Diamond Carat from Visual Characteristics

This group project investigates whether a diamond's carat weight can be estimated from visually observable characteristics alone - without physically weighing it. Using the diamonds dataset (53,940 observations), we applied best subset selection across 25 candidate predictors (including squared dimension terms) with a three-way train/validation/test split. The validation set objectively selected a 5-predictor model: table, poly(x, 2), and poly(z, 2). Orthogonal polynomials resolved the collinearity between raw and squared terms (confirmed by VIF). The final model achieved an R² of 0.997 and test RMSE of 0.026 carats - demonstrating that physical dimensions alone predict carat weight with near-perfect accuracy, while quality grades (cut, color, clarity) add no predictive value.

Count Data · Classification · LDA/QDA · Group Project

Predicting Educational Success - K-12 PFI Survey

This group project analyzes data from the 2019 Parent and Family Involvement (PFI) in Education Survey, collected by the US Census Bureau for the Department of Education. The dataset contains 15,500 observations, each representing one K-12 student, with variables covering school type, family engagement behaviors, child academic outcomes, and household demographics. For the count outcome, we diagnosed severe overdispersion (variance/mean ratio = 12.2) and used Negative Binomial regression instead of Poisson, achieving an AIC improvement of ~34,944 points. For classification, we compared logistic regression (AUC = 0.743), multinomial regression with upsampling via themis to address class imbalance (31:1 ratio), and LDA/QDA restricted to continuous predictors (AUC ≈ 0.651). Logistic regression was the strongest classifier; disability status and parent education were the dominant predictors across all models.

Section 04

Reflection

The five reflections below each address one course learning objective, showing a specific piece of work and explaining how it demonstrates that objective. A final reflection describes my participation in the course community. Click any card to expand.

Objective 1 Probability & Inference - Project 01: Simple Linear Regression

▼

In Project 01 (Simple Linear Regression), I demonstrate Objective 1 by treating probability and inference as the backbone of every analytical decision, not just a final reporting step. The core of this objective is the understanding that statistical modeling is fundamentally about quantifying uncertainty - and this project shows that in practice. For the Auto dataset, I used p-values to formally test whether the relationship between horsepower and mpg was statistically distinguishable from chance, arriving at a p-value well below 0.05 and concluding the relationship is real. More importantly, I went beyond the point estimate by computing 95% confidence intervals for the mean response and 95% prediction intervals for individual observations, making explicit the difference between uncertainty about the average car and uncertainty about any specific car - a distinction that sits at the heart of inferential reasoning.

The Boston housing analysis extended this inferential framework to a train/test validation context, which connects probability directly to predictive modeling. I used the fitted model to generate predictions on held-out data and quantified uncertainty around those predictions through both interval estimates and performance metrics (RMSE, R²). The bias-variance trade-off comparison between linear and cubic models also demonstrates probabilistic thinking: a cubic model minimizes training error but inflates variance, meaning its predictions are less reliable in expectation on new data. Understanding that a model's behavior on training data is a biased estimate of its true generalization performance is itself a probabilistic insight - and one that I applied concretely by selecting the simpler model on principled grounds rather than just chasing a lower RSS.

Maximum likelihood estimation (MLE), which underpins linear regression through the assumption of normally distributed errors, also informed how I interpreted my diagnostic plots. The Q-Q plot is a direct visual check of the normality assumption that MLE depends on; the residual-vs-fitted plot checks homoscedasticity, which affects the validity of standard errors and therefore all downstream inference. By working through these diagnostics and explaining what each violation would mean for the reliability of my p-values and intervals, I demonstrated that inference is not a mechanical output but a conclusion that is only valid when the probabilistic assumptions behind the model are adequately met. This disciplined approach to assumption-checking is what makes the inferential claims in the report trustworthy rather than merely reported.

Objective 2 Generalized Linear Models - Projects 02, 04 & 06

▼

Objective 2 asks me to apply the appropriate generalized linear model for a specific data context, and across three projects I worked with three different GLM families that each demanded a different distributional assumption. In Project 02 (Carseats), I applied multiple linear regression with both a quantitative predictor (Price) and a qualitative predictor (ShelveLoc), along with their interaction. The inclusion of an interaction term meant the model produced three distinct regression lines - one per shelf location category - and interpreting those lines required understanding not just the math but what the conditional slopes actually mean: for Good shelf locations, each one-dollar increase in price reduces predicted sales by 0.063 units, versus only 0.050 for Bad locations. Choosing this model over a simpler additive one was itself a modeling decision grounded in the data, as my ggpairs exploration showed that shelf location visibly separated the data into distinct groups before I fit anything.

In Project 04 (Nonvoters), I applied multinomial logistic regression to a three-class outcome - always, sporadic, and rarely/never voters - from a set of demographic and party affiliation predictors. Selecting multinomial regression here was the appropriate choice because the outcome had more than two unordered categories, and binary logistic regression would have thrown away information. I wrote out both log-odds equations in full, converted those equations into odds-ratio statements (e.g., Democrats have 305.9% higher odds of being always vs rarely/never voters than the "Other" reference group), and compared model performance with and without party affiliation. In Project 06 (K-12 PFI), I applied Negative Binomial regression to a count outcome - times parents participated in school activities - after diagnosing severe overdispersion (variance-to-mean ratio = 12.2). Standard Poisson assumes variance equals the mean; when that assumption is violated, standard errors are underestimated and hypothesis tests become invalid. Switching to Negative Binomial and quantifying the AIC improvement of ~34,944 points demonstrated that choosing the right distributional family is not a formality — it is a substantive modeling decision with real consequences for the reliability of inference.

Across all three projects, what tied these choices together was the same underlying discipline: before selecting a model, I examined the data structure and the nature of the outcome, then matched the model family to those properties. That diagnostic-first mindset - rather than defaulting to the most familiar method - is what I take as the core of Objective 2, and it is a habit I will carry into every analysis I do going forward.

Objective 3 Model Selection - Projects 03 & 05

▼

Objective 3 requires me to demonstrate model selection given a set of candidate models, and the two projects that best illustrate this are Project 03 (Auto backward elimination) and Project 05 (Diamonds best subset). In Project 03, I performed two sequential rounds of elimination on the Auto dataset: first, a manual backward elimination by p-value that removed mpg, cylinders, and year one at a time, tracking adjusted R² at each step; second, a VIF-based elimination that removed displacement due to severe collinearity with the remaining predictors. The final model - predicting acceleration from horsepower and weight - was arrived at through a documented, step-by-step process with a clear rationale at every stage. The most important lesson here was that eliminating a predictor because it is collinear, even if it is individually significant, is a legitimate and necessary modeling decision when the goal is producing stable, interpretable coefficient estimates rather than maximizing R².

In Project 05 (Diamonds), our group applied best subset selection across 25 candidate predictors using regsubsets(), then used a held-out validation set - not training-set criteria - to objectively choose the final model size. This three-way split (64% training, 16% validation, 20% test) meant that every model selection decision was made on data the final model had never seen. The validation set selected a 5-predictor polynomial model - table, poly(x, 2), and poly(z, 2) - and the held-out test R² of 0.997 confirmed that the selection procedure generalized well. An interesting finding was the sharp spike in validation MSE when model size jumped from 7 to 8 predictors, which I traced back to the instability introduced by adding a partially included categorical variable (cut) alongside already-collinear dimension predictors. That investigation taught me that model selection criteria are not monotone, and a larger model is not always a better one - a principle that applies far beyond this dataset.

What both projects share is a commitment to making model selection decisions with data the model has not already learned from. Whether through adjusted R², VIF thresholds, or validation-set MSE, the underlying principle is the same: a model's value is measured by how well it works on new data, not on the sample used to fit it. Applying this principle consistently is what turns model selection from a mechanical procedure into a form of genuine statistical reasoning.

Objective 4 Statistical Communication - Projects 04 & 06

▼

Objective 4 asks me to express the results of statistical models to a general audience, and this is where I feel I grew the most over the course of the semester. My early analyses tended to report coefficients and metrics without translation - a slope of -0.05, an R² of 0.57 - and leave the interpretation to the reader. By Projects 04 and 06, I had developed a consistent practice of converting every model output into a plain-language statement before moving on. In Project 04 (Nonvoters), instead of reporting that the Democrat party indicator had a coefficient of 1.401 in the log-odds equation, I computed the exponentiated odds ratio and reported that Democrats have 305.9% higher odds of being always voters compared to rarely/never voters relative to the "Other" reference group. For the gender slope, I translated -0.192 log-odds into a 17.47% reduction in the odds of being an always voter - a number a policymaker or journalist could actually use. I also explained the confusion matrix in plain terms, naming the model's central-category bias and describing why sporadic was over-predicted structurally, not just as an observed error rate.

In Project 06 (K-12 PFI), the communication challenge was larger because we had multiple models and multiple outcome types running in parallel. For the Negative Binomial model, I reported that each additional school activity a parent participated in multiplied their expected participation count by 1.29 - framing the result multiplicatively because that is how Negative Binomial coefficients work on the original scale, and because a multiplication factor is more intuitive than a log-count change for a non-statistical audience. For the logistic regression, I used plain English to describe the magnitude and direction of every significant predictor: children without a disability have 65% lower odds of being a non-high-achiever; girls have 42% lower odds than boys; each step up in household income is associated with slightly higher odds of academic achievement. These are findings with real meaning for school administrators and policymakers, and communicating them honestly required thinking carefully about not just what to say but how to say it without overstating certainty or hiding limitations.

The skill I developed through this objective goes beyond writing. Learning to translate statistical output into accessible language also forced me to check whether I actually understood what I was reporting. If I could not explain a coefficient in words without jargon, it usually meant I had not fully internalized what the model was doing. In that sense, Objective 4 functioned as a diagnostic for all the other objectives - a way of catching gaps in understanding that numerical outputs alone would not surface.

Objective 5 Technical Proficiency - Projects 05 & 06

▼

Objective 5 requires me to use programming software to fit and assess statistical models, and the two projects that best showcase this proficiency are Projects 05 and 06. In Project 05 (Diamonds), I used regsubsets() from the leaps package for exhaustive best subset selection across 25 candidate predictors - a procedure that would be computationally impractical to do manually. I then built a validation-set loop to compute prediction error for every model size, identified the best size programmatically, refitted the final model using poly() for orthogonal polynomial encoding (rather than raw squared terms), and confirmed via vif() that collinearity was resolved. The full pipeline - data cleaning, three-way splitting with set.seed() for reproducibility, model selection, diagnostics, and held-out test evaluation - was written in R Markdown with code and narrative interwoven throughout, producing a document that is both a working analysis and a readable report. Working with 53,940 observations also required attention to code efficiency that smaller datasets do not demand.

In Project 06 (K-12 PFI), the technical scope was even broader. I worked with a real government survey dataset containing 75 variables and meaningful missing-value codes (negative integers as valid skips), which required careful recoding before any analysis could begin. The modeling workflow spanned four different model types within a single document: glm.nb() for Negative Binomial regression, logistic_reg() with the glm engine, multinom_reg() with the nnet engine (including repair_call() to fix tidymodels' internal call storage), and discrim_linear() / discrim_quad() with the MASS engine for LDA and QDA. Across all models, I used an 80/20 stratified train/test split, computed AIC comparisons and likelihood ratio tests for model selection, and evaluated final performance using metric_set(), conf_mat(), and roc_curve(). The upsampling for the multinomial model was implemented through a full tidymodels recipe using step_upsample() from themis - applied only to training data via prep() and bake() - ensuring that the test set remained in its natural distribution for an honest evaluation.

What these two projects demonstrate together is not just that I can run the functions, but that I understand when to use which tool and why. Choosing poly() over raw squared terms, stratifying the split by the outcome variable, repairing the model call before augmenting - each of these decisions required knowing something about how R and tidymodels work internally, not just what the surface-level syntax looks like. That depth of technical understanding is what I mean by proficiency, and it is the foundation on which all the other objectives in this portfolio ultimately rest.

Community Participation in the Course Learning Community

▼

Throughout STA 631, I made a point of attending every class and ensuring I genuinely understood the material being delivered rather than simply being present. When something was unclear, I asked questions in class and followed up with internet research on my own time to fill in any gaps. I found that taking this extra step - going beyond what was covered in the session - helped me build a more solid foundation before moving on to the next topic.

The assignments and projects were especially valuable in reinforcing that understanding. Each one gave me an opportunity to revisit concepts from class and apply them to real data, which revealed gaps in my comprehension that lectures alone had not surfaced. When I ran into confusion, I worked through it by discussing with classmates, and when I had a clearer grasp of something a peer was struggling with, I offered an explanation in return. That back-and-forth was one of the most effective parts of my learning process throughout the semester.

Two projects in this portfolio - Projects 05 and 06 - were completed in groups. Collaborating on those analyses pushed me to articulate methodological decisions clearly to teammates: why we removed zero-dimension diamonds, why we used a three-way split, why we switched from Poisson to Negative Binomial. Having to justify those choices to collaborators, not just to an instructor, held me to a higher standard of clarity that I carried back into my individual work as well.

Section 05

Self-Evaluation Letter

A letter written in response to my Letter of Commitment from the beginning of the semester, reflecting on the goals I set, the progress I made, and the challenges I encountered along the way.

Udita Bista · Grand Rapids, MI 49503

(616) 334-0874 · bistau@mail.gvsu.edu

April 25, 2026

John Appiah Kubi

Assistant Professor, Department of Statistics

Mackinac Hall A-1-178, Allendale, Michigan 49401

Dear Professor John,

At the start of this semester, I wrote a Letter of Commitment outlining three learning goals for STA 631: writing cleaner and more readable statistical code, developing a deeper understanding of the computational ideas behind statistical methods, and improving my consistency in time management and independent problem-solving. Now that the semester is over I can say that I met each of these goals to an extent. In some areas I even did better than I thought I would.

My first goal was to write statistical code that is not just functional but clean, readable, and reproducible. I am genuinely proud of the progress I made here. I worked consistently within R Markdown for every assignment and project, blending my code and explanations together so each document functions as a complete, clear analysis instead of just a series of separate results. I learned to use the Tidymodels framework with increasing fluency by creating modular and easy-to-follow workflows from exploratory data analysis, train/test split, building models to validating model’s fit. Working with my teammates on group projects helped me to be more careful about how I wrote my code and how I explained it. What I did not expect was that writing code would also help me understand the concepts better. When I could not explain something clearly it meant I did not fully understand it yet. This habit also helped me to organize my thoughts and document them for the future.

My second goal was to understand what actually happens inside statistical procedures- the computational reasoning behind the methods, not just their application. This is the area where I feel I grew the most unexpectedly. Working through backward elimination step by step, tracking adjusted R² at each stage, and then discovering that statistically significant predictors still needed to be removed because of multicollinearity gave me a much more concrete sense of how model-fitting decisions interact with each other. Similarly, diagnosing overdispersion in the K-12 PFI project and understanding why violating the Poisson equidispersion assumption invalidates standard errors - rather than just knowing that Negative Binomial is an alternative - was exactly the kind of deeper reasoning I had set out to develop. I also did not anticipate how much the polynomial regression project would teach me about the mechanics of collinearity: using orthogonal polynomials via poly() rather than raw squared terms, and then verifying the fix through VIF diagnostics, was a level of computational understanding I did not have at the start of the semester.

My third goal was time management and independent problem solving, and this is where I was most honest with myself about what remained difficult. I did employ disciplined scheduling to remain on top of deadlines, and I made an effort to solve difficulties on my own before requesting assistance. However, there were times when I underestimated the time required, especially during the more technically hard assignments, and I felt the pressure of competing courses and personal commitments as an overseas student juggling a full graduate workload. What made a substantial difference was Professor John's response to the class's requirements. When it was evident that additional time was needed to connect with the material properly, he listened and extended deadlines rather than clinging stubbornly to the original schedule. That flexibility was not something I anticipated, and it genuinely changed how I approached the work - instead of rushing to submit something complete, I had the space to go deeper, revisit my reasoning, and produce analyses I was actually confident in. It taught me that asking for time when you need it is not a sign of poor planning but of knowing what good work requires.

Beyond my original three goals, I achieved things I did not anticipate when I wrote my commitment letter. I did not expect to work on group projects of the scale of the diamonds and K-12 PFI analyses, and collaborating on those pushed me to develop skills in communicating methodological decisions to peers - a form of statistical communication that is different from writing for an instructor and, I found, more demanding in its own way. I also developed a much stronger intuition for model evaluation: I now think about train/test splits, validation sets, and out-of-sample performance as a natural first step in any modeling task, not as an afterthought.

The challenges I did not fully anticipate were the conceptual leaps between methods. Moving from linear regression to multinomial logistic regression, and then to count models and discriminant analysis, required more active adjustment than I expected - not just learning new syntax, but rebuilding my intuition about what a model is doing and how to read its output. I overcame this primarily through the assignments themselves, through discussions with classmates when I was stuck, and through returning to the course materials and supplementary resources when something did not click the first time.

Looking ahead, I intend to build on the foundation this course has given me by continuing to work with Tidymodels in my remaining coursework and exploring regularization methods such as ridge and lasso regression, which came up at the edges of this course but which I want to develop more fully. The habits of assumption-checking, honest model evaluation, and translating statistical output into plain language are ones I will carry into every analysis I do going forward - in the Statistical Modelling II course and in my career.

Thank you for a semester that genuinely challenged and expanded how I think about statistical modeling.

Sincerely,
Udita