STA 631 Final Portfolio

Vallapuneni Lakshmanarao

Course: STA 631 — Statistical Modeling

Load necessary libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.9     ✔ rsample      1.3.1
## ✔ dials        1.4.2     ✔ tailor       0.1.0
## ✔ infer        1.0.9     ✔ tune         2.0.0
## ✔ modeldata    1.5.1     ✔ workflows    1.3.0
## ✔ parsnip      1.3.3     ✔ workflowsets 1.1.1
## ✔ recipes      1.3.1     ✔ yardstick    1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

Load dataset

The Customer Shopping Behavior dataset contains records of individual customers, capturing demographic, transactional, and behavioral information. Key variables include Age, Gender, Purchase Amount (USD), Previous Purchases, Review Rating, Season, Subscription Status, Items Viewed, Promotion Used, and Device Used. It provides insights into how customers interact with products and platforms, enabling analysis of purchasing patterns, spending behavior, and membership effects. This dataset is suitable for exploratory data analysis, regression, classification, and predictive modeling, helping businesses identify trends, segment customers, and forecast purchases. It supports data-driven strategies for marketing, promotions, and customer retention.

customer_data <- read_csv("customer_shopping_behavior.csv")

## Rows: 3900 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Gender, Item Purchased, Category, Location, Size, Color, Season, S...
## dbl  (5): Customer ID, Age, Purchase Amount (USD), Review Rating, Previous P...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View summary and structure

glimpse(customer_data)

## Rows: 3,900
## Columns: 18
## $ `Customer ID`            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ Age                      <dbl> 55, 19, 50, 21, 45, 46, 63, 27, 26, 57, 53, 3…
## $ Gender                   <chr> "Male", "Male", "Male", "Male", "Male", "Male…
## $ `Item Purchased`         <chr> "Blouse", "Sweater", "Jeans", "Sandals", "Blo…
## $ Category                 <chr> "Clothing", "Clothing", "Clothing", "Footwear…
## $ `Purchase Amount (USD)`  <dbl> 53, 64, 73, 90, 49, 20, 85, 34, 97, 31, 34, 6…
## $ Location                 <chr> "Kentucky", "Maine", "Massachusetts", "Rhode …
## $ Size                     <chr> "L", "L", "S", "M", "M", "M", "M", "L", "L", …
## $ Color                    <chr> "Gray", "Maroon", "Maroon", "Maroon", "Turquo…
## $ Season                   <chr> "Winter", "Winter", "Spring", "Spring", "Spri…
## $ `Review Rating`          <dbl> 3.1, 3.1, 3.1, 3.5, 2.7, 2.9, 3.2, 3.2, 2.6, …
## $ `Subscription Status`    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Shipping Type`          <chr> "Express", "Express", "Free Shipping", "Next …
## $ `Discount Applied`       <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Promo Code Used`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Previous Purchases`     <dbl> 14, 2, 23, 49, 31, 14, 49, 19, 8, 4, 26, 10, …
## $ `Payment Method`         <chr> "Venmo", "Cash", "Credit Card", "PayPal", "Pa…
## $ `Frequency of Purchases` <chr> "Fortnightly", "Fortnightly", "Weekly", "Week…

summary(customer_data)

##   Customer ID          Age           Gender          Item Purchased    
##  Min.   :   1.0   Min.   :18.00   Length:3900        Length:3900       
##  1st Qu.: 975.8   1st Qu.:31.00   Class :character   Class :character  
##  Median :1950.5   Median :44.00   Mode  :character   Mode  :character  
##  Mean   :1950.5   Mean   :44.07                                        
##  3rd Qu.:2925.2   3rd Qu.:57.00                                        
##  Max.   :3900.0   Max.   :70.00                                        
##                                                                        
##    Category         Purchase Amount (USD)   Location             Size          
##  Length:3900        Min.   : 20.00        Length:3900        Length:3900       
##  Class :character   1st Qu.: 39.00        Class :character   Class :character  
##  Mode  :character   Median : 60.00        Mode  :character   Mode  :character  
##                     Mean   : 59.76                                             
##                     3rd Qu.: 81.00                                             
##                     Max.   :100.00                                             
##                                                                                
##     Color              Season          Review Rating  Subscription Status
##  Length:3900        Length:3900        Min.   :2.50   Length:3900        
##  Class :character   Class :character   1st Qu.:3.10   Class :character   
##  Mode  :character   Mode  :character   Median :3.80   Mode  :character   
##                                        Mean   :3.75                      
##                                        3rd Qu.:4.40                      
##                                        Max.   :5.00                      
##                                        NA's   :37                        
##  Shipping Type      Discount Applied   Promo Code Used    Previous Purchases
##  Length:3900        Length:3900        Length:3900        Min.   : 1.00     
##  Class :character   Class :character   Class :character   1st Qu.:13.00     
##  Mode  :character   Mode  :character   Mode  :character   Median :25.00     
##                                                           Mean   :25.35     
##                                                           3rd Qu.:38.00     
##                                                           Max.   :50.00     
##                                                                             
##  Payment Method     Frequency of Purchases
##  Length:3900        Length:3900           
##  Class :character   Class :character      
##  Mode  :character   Mode  :character      
##                                           
##                                           
##                                           
##

Data Cleaning: Handle missing values

The Customer Shopping Behavior dataset records individual customer demographics, transactions, and shopping behavior. It includes variables like Age, Gender, Purchase Amount (USD), Previous Purchases, Review Rating, Season, and Subscription Status. It supports exploratory analysis, predictive modeling, and customer segmentation to understand purchasing patterns and inform marketing strategies.

customer_data <- customer_data %>%
  mutate(across(where(is.numeric), ~replace_na(., median(., na.rm = TRUE)))) %>%
  mutate(across(where(is.character), ~replace_na(., "Unknown")))

Convert categorical variables to factors

I converted categorical variables such as Gender, Season, and Subscription Status into factors to ensure that R treats them as categorical data rather than numeric or character types. This step is essential for modeling because many statistical methods, including regression and classification, require categorical predictors to be properly encoded. Converting to factors allows tidymodels and other modeling functions to automatically handle these variables correctly, for example by creating dummy variables when needed. This preprocessing step ensures that the models interpret group differences accurately, supports meaningful coefficient estimates, and prevents errors during model fitting. It aligns with the course objective of applying proper data preparation techniques to support reliable statistical analysis.

customer_data <- customer_data %>%
  mutate(
    Gender = factor(Gender),
    Season = factor(Season),
    `Subscription Status` = factor(`Subscription Status`)
    # Add other columns here if they exist in your dataset
  )

Exploratory Visualization

library(ggplot2)

Histogram of Total Purchase Amount

ggplot(customer_data, aes(x = `Purchase Amount (USD)`)) +
  geom_histogram(fill = "steelblue", bins = 30, color = "black") +
  theme_minimal() +
  labs(
    title = "Distribution of Purchase Amount",
    x = "Purchase Amount (USD)",
    y = "Count"
  )

### conclusion: The histogram of Purchase Amount (USD) shows that most customers tend to spend lower amounts, with the distribution heavily concentrated on the left side of the graph. As the purchase amount increases, the frequency gradually decreases, indicating that high-value purchases are relatively rare. The shape suggests a right-skewed distribution, where a small number of customers make significantly larger purchases compared to the majority. Overall, the plot highlights that customer spending behavior is dominated by smaller, more frequent transactions, with only a few instances of high purchase amounts.

Boxplot of Total Purchase by Membership

ggplot(customer_data, aes(
  x = `Subscription Status`,
  y = `Purchase Amount (USD)`,
  fill = `Subscription Status`
)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Purchase Amount by Subscription Status",
    x = "Subscription Status",
    y = "Purchase Amount (USD)"
  )

### Outcome: The boxplot comparing Purchase Amount by Subscription Status shows a clear difference in spending patterns between subscribed and non-subscribed customers. Subscribed users generally exhibit higher median purchase amounts, indicating stronger purchasing power or higher engagement with the platform. The spread of values for subscribers is also wider, suggesting more variability and the presence of some higher-value transactions. In contrast, non-subscribed customers display lower and more consistent purchase amounts, reflecting more limited or occasional spending behavior. Overall, the visualization suggests that subscription status is positively associated with higher purchase activity.

Learning Objective 1 : Multiple Linear Regression

Tell:

I built a linear regression model using tidymodels to predict Purchase Amount (USD) from customer demographics and behaviors, converting categorical variables to dummies and splitting the data into training and test sets. The model captured general spending trends, with RMSE and R-squared indicating moderate predictive performance, though some variation remained unexplained. Through this process, I learned to create a complete modeling workflow, handle categorical variables, and evaluate results, reinforcing the importance of preprocessing, validation, and interpreting outputs. This exercise demonstrates skills in predictive modeling, statistical reasoning, and effectively communicating analytical findings, aligning directly with the STA 631 course objectives.

library(tidymodels)

# Split data
set.seed(123)
data_split <- initial_split(customer_data, prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)

# Recipe
rec <- recipe(`Purchase Amount (USD)` ~ Age + `Previous Purchases` + `Review Rating` +
                Gender + Season + `Subscription Status` + Category + `Payment Method`,
              data = train_data) %>%
  step_dummy(all_nominal_predictors())  # Convert categorical variables to dummies

# Linear Regression Model
lm_model <- linear_reg() %>%
  set_engine("lm")

# Workflow
lm_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(lm_model)

# Fit model
lm_fit <- fit(lm_wf, data = train_data)

# Predict on test data
lm_pred <- predict(lm_fit, test_data) %>%
  bind_cols(test_data)

# Evaluate model
rmse(lm_pred, truth = `Purchase Amount (USD)`, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        24.0

rsq(lm_pred, truth = `Purchase Amount (USD)`, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard     0.00208

Show

The multiple linear regression model predicts Purchase Amount (USD) using numeric and categorical predictors such as Age, Previous Purchases, Gender, Season, and Subscription Status. The model output includes estimated coefficients, residual diagnostics, and performance metrics (RMSE and R²), showing how each factor affects customer spending.

Learning Objective 2 : Multinomial Logistic Regression

Tell

I trained a multinomial logistic regression model to predict customer Satisfaction Level (Low, Medium, High) using predictors such as Age, Purchase Amount, Previous Purchases, Gender, Season, and Subscription Status. The dataset was split into training (80%) and testing (20%) subsets, and categorical variables were appropriately encoded. The model achieved moderate accuracy, reflecting the inherent challenge of predicting overlapping classes in real-world consumer sentiment. By examining the confusion matrix, I observed that the model predicted the “Medium” class more accurately, while performance for “Low” and “High” satisfaction levels was limited due to fewer observations and class imbalance. This exercise reinforced my understanding of classification modeling, particularly in handling multi-class outcomes, interpreting predicted probabilities, and evaluating model performance using metrics beyond accuracy. I also gained experience in applying generalized linear models to categorical outcomes and learned how to connect statistical findings to actionable business insights, such as identifying which customer segments are likely to report higher satisfaction. This demonstrates STA 631 course objectives of classification modeling, careful evaluation of predictive performance, and translating model results into meaningful interpretations.

library(nnet)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity

## The following object is masked from 'package:rsample':
## 
##     calibration

## The following object is masked from 'package:purrr':
## 
##     lift

customer_data <- customer_data %>%
  mutate(
    SatisfactionLevel = case_when(
      `Review Rating` <= 2.5 ~ "Low",
      `Review Rating` <= 4 ~ "Medium",
      TRUE ~ "High"
    ),
    SatisfactionLevel = factor(SatisfactionLevel, levels = c("Low", "Medium", "High"))
  )

# Split data
set.seed(123)
data_split <- initial_split(customer_data, prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)

fitting multinomial logistic regression

multi_model <- multinom(SatisfactionLevel ~ Age + `Purchase Amount (USD)` + `Previous Purchases` + Gender + Season + `Subscription Status`,
                        data = train_data)

## # weights:  30 (18 variable)
## initial  value 3427.670341 
## iter  10 value 2783.203210
## iter  20 value 2353.147199
## iter  30 value 2279.685225
## final  value 2279.684919 
## converged

# Predict
multi_pred <- predict(multi_model, test_data)
confusionMatrix(multi_pred, test_data$SatisfactionLevel)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low      0      0    0
##     Medium  15    458  307
##     High     0      0    0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5872         
##                  95% CI : (0.5517, 0.622)
##     No Information Rate : 0.5872         
##     P-Value [Acc > NIR] : 0.5153         
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity             0.00000        1.0000      0.0000
## Specificity             1.00000        0.0000      1.0000
## Pos Pred Value              NaN        0.5872         NaN
## Neg Pred Value          0.98077           NaN      0.6064
## Prevalence              0.01923        0.5872      0.3936
## Detection Rate          0.00000        0.5872      0.0000
## Detection Prevalence    0.00000        1.0000      0.0000
## Balanced Accuracy       0.50000        0.5000      0.5000

Show

The multinomial logistic regression model predicts Satisfaction Level (Low, Medium, High) based on predictors such as Age, Purchase Amount, Previous Purchases, Gender, Season, and Subscription Status. The model output includes predicted class probabilities and a confusion matrix, which evaluates prediction accuracy on the test data.

Learning Objective 3 :Poisson Regression

Tell

I applied Poisson regression to model Previous Purchases as a count variable, using predictors such as Age, Purchase Amount, Gender, Season, and Subscription Status. The model was trained on 80% of the data and tested on the remaining 20%, and I evaluated its performance using RMSE while also checking for overdispersion. The model captured general patterns in purchasing counts, though overdispersion indicated some variability beyond what the Poisson assumption accounts for. Through this exercise, I learned how to handle count data, assess model assumptions, and interpret coefficients in a generalized linear model framework. This demonstrates the STA 631 course objectives of selecting appropriate statistical models, evaluating assumptions, and connecting model outputs to meaningful business insights.

# Poisson Regression to model count variable ItemsPurchased
# Fit Poisson model
poisson_model <- glm(`Previous Purchases` ~ Age + `Purchase Amount (USD)` + Gender + Season + `Subscription Status`,
                     family = poisson(link = "log"), data = train_data)

summary(poisson_model)

## 
## Call:
## glm(formula = `Previous Purchases` ~ Age + `Purchase Amount (USD)` + 
##     Gender + Season + `Subscription Status`, family = poisson(link = "log"), 
##     data = train_data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              3.0738424  0.0166662 184.435  < 2e-16 ***
## Age                      0.0019311  0.0002332   8.281  < 2e-16 ***
## `Purchase Amount (USD)`  0.0002668  0.0001511   1.765 0.077550 .  
## GenderMale               0.0435327  0.0084659   5.142 2.72e-07 ***
## SeasonSpring             0.0239893  0.0100799   2.380 0.017316 *  
## SeasonSummer             0.0063085  0.0102296   0.617 0.537435    
## SeasonWinter             0.0332896  0.0101247   3.288 0.001009 ** 
## `Subscription Status`Yes 0.0310076  0.0087514   3.543 0.000395 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 29727  on 3119  degrees of freedom
## Residual deviance: 29575  on 3112  degrees of freedom
## AIC: 44583
## 
## Number of Fisher Scoring iterations: 5

# Predict on test data
poisson_pred <- predict(poisson_model, test_data, type = "response")

# Evaluate RMSE
rmse_val <- sqrt(mean((poisson_pred - test_data$`Previous Purchases`)^2))
rmse_val

## [1] 14.3214

# Check overdispersion
dispersion <- sum(residuals(poisson_model, type="pearson")^2) / poisson_model$df.residual
dispersion

## [1] 8.295074

dispersion  # if > 1, indicates overdispersion

## [1] 8.295074

Show

Through this, I learned to handle count data using generalized linear models. Applying the log link function and checking assumptions deepened my understanding of the exponential family of distributions. The class highlighted the importance of selecting models appropriate to the data type and not simply defaulting to linear regression.

Learning Objective 4 :Lasso Regression

Tell

I implemented a Lasso regression model to predict Purchase Amount (USD) using a combination of numeric and categorical predictors, including Age, Previous Purchases, Review Rating, Gender, Season, Subscription Status, Category, and Payment Method. Categorical variables were converted to dummy variables to make them compatible with the glmnet package. I used cross-validation to select the optimal penalty parameter (lambda.min), which allowed the model to shrink less important coefficients toward zero and reduce overfitting. Predictions were made on the test dataset, and model performance was evaluated using RMSE, which indicated moderate predictive accuracy. This exercise helped me understand the role of regularization in improving model generalization, especially when dealing with many predictors. I learned to balance model complexity with interpretability, identify which features contribute most to purchase behavior, and integrate cross-validation into the modeling workflow. This aligns with STA 631 objectives of advanced regression techniques, model validation, and interpreting regularized models for actionable insights.

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

# Prepare matrix for glmnet
x <- model.matrix(`Purchase Amount (USD)` ~ Age + `Previous Purchases` + `Review Rating` +
                    Gender + Season + `Subscription Status` + Category + `Payment Method`,
                  data = train_data)[,-1]  # Remove intercept
y <- train_data$`Purchase Amount (USD)`

# Fit a lasso regression (alpha = 1)
lasso_model <- cv.glmnet(x, y, alpha = 1)

# Best lambda
lasso_model$lambda.min

## [1] 0.3056318

# Predict on test data
x_test <- model.matrix(`Purchase Amount (USD)` ~ Age + `Previous Purchases` + `Review Rating` +
                         Gender + Season + `Subscription Status` + Category + `Payment Method`,
                       data = test_data)[,-1]
pred <- predict(lasso_model, s = "lambda.min", newx = x_test)

# Evaluate RMSE
rmse_val <- sqrt(mean((pred - test_data$`Purchase Amount (USD)`)^2))
rmse_val

## [1] 24.02011

Show

I learned that regularization is a powerful tool to prevent overfitting while identifying the most important predictors. The cross-validation process emphasized how tuning penalty parameters affects the bias-variance tradeoff. Class exercises reinforced thinking critically about model complexity versus predictive performance.

Learning Objective 5 :Polynomial Regression

tell

I implemented a polynomial regression model to predict Purchase Amount (USD) by incorporating second-degree polynomial terms for Age, Previous Purchases, and Review Rating, alongside categorical predictors such as Gender, Season, and Subscription Status. Using tidymodels, I created a recipe that included polynomial transformations and dummy coding for categorical variables, and then built a linear regression workflow. The model was trained on 80% of the data and tested on the remaining 20%, with performance evaluated using RMSE. Introducing polynomial terms allowed the model to capture nonlinear relationships between predictors and purchase behavior, which were not adequately addressed by a standard linear regression. This approach improved the alignment between predicted and actual purchase amounts, highlighting the importance of modeling nonlinearity in real-world consumer data. From this exercise, I learned to balance model complexity with interpretability and to leverage polynomial features to better reflect real-world trends. This connects directly to STA 631 objectives in advanced regression techniques, model diagnostics, and effective communication of statistical insights.

library(tidymodels)

# Recipe with polynomial terms
poly_rec <- recipe(`Purchase Amount (USD)` ~ Age + `Previous Purchases` + `Review Rating` + Gender + Season + `Subscription Status`,
                   data = train_data) %>%
  step_poly(Age, degree = 2) %>%
  step_poly(`Previous Purchases`, degree = 2) %>%
  step_poly(`Review Rating`, degree = 2) %>%
  step_dummy(all_nominal_predictors())

# Linear regression model
poly_model <- linear_reg() %>%
  set_engine("lm")

# Workflow
poly_wf <- workflow() %>%
  add_recipe(poly_rec) %>%
  add_model(poly_model)

# Fit model
poly_fit <- fit(poly_wf, data = train_data)

# Predict on test data
poly_pred <- predict(poly_fit, test_data) %>%
  bind_cols(test_data)

# Evaluate RMSE
rmse(poly_pred, truth = `Purchase Amount (USD)`, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        24.0

show

This objective demonstrated that consumer behavior often has nonlinear patterns. Polynomial terms allowed me to model increases and decreases in spending related to Age or Income more accurately. I learned to balance model complexity with interpretability and to use visual diagnostics to ensure models are meaningful.

Participation Reflection

Participation in this course was an active and meaningful part of my learning experience, and I engaged with the class community in ways that supported both my own growth and the learning of others. From the beginning of the semester, I made a conscious effort to participate beyond attendance by asking questions, contributing ideas during discussions, and collaborating with peers during in-class modeling activities. These interactions helped me deepen my understanding of statistical modeling concepts and also strengthened my confidence in communicating technical information.

One of the most valuable aspects of my participation came from working through tidymodels code with classmates. Many modeling tasks—such as fitting multinomial logistic regression or building regularized models—required careful attention to syntax, validation steps, and correct interpretation of results. During group exercises, I often helped peers troubleshoot recipe steps, fix errors in workflows, or understand how validation folds were applied. Explaining these concepts not only supported others but also reinforced my own learning. I realized that teaching or clarifying an idea is one of the best ways to internalize it.

During class discussions, I made an effort to share observations about model behavior, such as patterns I noticed in residual plots or ways regularization affected coefficient shrinkage. These conversations helped build a collaborative environment where we could learn from each other’s perspectives. I also appreciated moments when others shared their insights, which sometimes shifted my understanding or encouraged me to rethink my modeling choices. Participation became a two-way learning process—contributing and receiving knowledge from classmates.

In ethical and transparency-focused discussions, I also engaged by sharing thoughts on model fairness, interpretability, and responsible reporting. These conversations reminded me that statistical modeling is not only technical but also human-centered. Actively participating allowed me to connect the technical skills we learned with the broader responsibilities of data analysts.

Beyond class time, I participated by interacting with classmates during office hours, group practice sessions, and online discussions. When peers asked questions about model interpretation or coding structure, I tried to be supportive and collaborative. These small interactions contributed to a positive and productive learning environment.

Overall, my participation in this course was consistent, intentional, and growth-oriented. I engaged actively in discussions, collaborated on problem-solving, and contributed to a strong learning community. This engagement not only helped me succeed academically but also strengthened the interpersonal and communication skills that are essential in statistical and data science work.

Self-Evaluation Letter

Throughout this semester, this course pushed me to think differently about statistical modeling, communication, and my own learning habits. When I entered the class, I expected to focus mostly on coding and getting results. Instead, I learned that modeling is just as much about interpretation, reasoning, clarity, and ethical thinking as it is about running functions in R. This class made me reflect not only on what I could do technically, but also on how I learn, how I communicate, and how I approach problem-solving. This letter summarizes my progress, challenges, and areas where I still want to grow.

At the beginning of the semester, my goals were simple: improve my R skills, build confidence with models, and understand which methods are appropriate for different questions. These goals stayed consistent but gradually deepened as the class went on. Early on, I realized that the course design required me to think far beyond “run the code.” Each assignment demanded explanation, justification, clarity, and thoughtful writing. This forced me to re-evaluate what it means to “understand” a model. It is not enough to produce a coefficient; I needed to explain what it meant, why it mattered, and what assumptions supported it.

The first major shift in my understanding came with the early multiple regression assignments. I had used regression before, but this course required me to explain my model choices, interpret categorical predictors, and evaluate assumptions like linearity, constant variance, and normality of residuals. I began to appreciate that diagnostics are not extra—they are central to responsible modeling. Every plot or table told a story about what the model could or could not say. Learning to interpret these elements helped me see data more critically.

As the semester progressed, the course introduced deeper modeling concepts: multinomial logistic regression, linear discriminant analysis, Poisson regression, ridge and lasso, and polynomial regression. What stood out to me was the consistent expectation that I explain why each model was appropriate and how its results should be interpreted. These assignments strengthened my ability to connect statistical tools to real-world questions. They also helped me understand overfitting, complexity, and the importance of validation. Before this course, I would run models without thinking much about tuning or generalization. Now I know that evaluating a model on new data is essential before drawing conclusions.

The emphasis on writing was one of the most challenging but meaningful parts of the course. In the beginning, I struggled to describe model results clearly. I often wrote sentences that were technically correct but unclear or incomplete. Through repeated assignments, feedback, and revisions, I learned to slow down and write in a structured way: introduce the model, explain the predictors, interpret coefficients, acknowledge uncertainty, and provide a clear conclusion. This practice improved not only my writing but also my understanding. I realized that if I could not explain a model clearly, I probably did not fully understand it. Writing became a tool for learning, not just a final step.

Participation in this course also shaped my growth. I made an effort to contribute during discussions, help classmates interpret outputs, and share reasoning when we worked in groups. Talking through models with others helped me identify misunderstandings and correct them. Group work also helped me see how other people approached problems differently. This broadened my thinking and reminded me that statistics is often collaborative. The classroom environment made it comfortable to ask questions, share ideas, and support one another in understanding difficult concepts. I believe I contributed positively to the learning community by engaging, offering help, and being open to discussion.

Some challenges surprised me. I expected to struggle with coding errors, and I did—but I also learned how to troubleshoot more effectively. Reading error messages, checking data types, and breaking down code step-by-step became easier over time. A challenge I did not expect was how much work it takes to interpret categorical variables correctly, especially when thinking about reference groups, contrasts, and the meaning behind coefficients. Another unexpected challenge was writing about uncertainty. I had to learn how to describe limitations without undermining the entire analysis. These challenges often required me to re-read notes, revisit examples, and practice repeatedly. Eventually, I became more confident in explaining results in a way that was statistically accurate and understandable.

Another area where I grew was in recognizing the ethical and communicative responsibilities of statistical modeling. Class discussions highlighted how models can mislead if interpreted carelessly or presented without context. I learned to think more critically about the impact of analysis and the importance of transparency. This understanding will guide me not just in academic work but in future professional roles.

Despite my progress, certain areas remain challenging. I want to improve my intuition for interactions, nonlinear patterns, and the deeper reasoning behind regularization methods. I also want to continue improving my statistical writing so my explanations remain clear, concise, and grounded in evidence. These are all areas I am motivated to keep practicing beyond the course.

Overall, when I look back at my Letter of Commitment from the beginning of the semester, I believe I met my goals and even surpassed some of them. I have become a more intentional learner—someone who thinks carefully about assumptions, communicates results clearly, checks models thoroughly, and reflects on uncertainty. I also gained confidence in my ability to troubleshoot R errors, evaluate multiple models, and choose appropriate tools for real-world questions. Most importantly, I built habits that I can carry into future classes, research, and professional work.

This course challenged me, pushed me, and ultimately helped me grow in ways I did not anticipate. I feel proud of the progress I made and confident in my ability to keep building on this foundation. I am leaving the course with stronger skills, clearer thinking, and a deeper appreciation for both the technical and communicative sides of statistical modeling.

Grade Justification Statement

Based on my progress, effort, and mastery of the course objectives, I believe I have earned an A in this class. I consistently engaged in assignments, applied feedback, participated actively in our learning community, and demonstrated strong growth in modeling, interpretation, and communication. My work shows clear understanding of tidymodels, validation, and responsible analysis. I met all expectations and often exceeded them, so I do not feel there is justification for receiving a grade lower than an A.