Machine Learning, Modeling, & Cross Validation

Unsupervised Learning Algorithms and Predictive modeling

Big Back Behavior | [Report]
Predicting obesity rankings based on behavioral survey results [click for more details]

  • Real world data; medium data set (approx 2,200 observations)
  • CART regression prediction models, CART classification prediction models, Bootstrap BAGGING regression prediction models, & Bootstrap BAGGING classification prediction models
  • Variable importance analysis, hyperparameter tuning, model pruning, ROC curve analysis & optimal cutoff identification
  • Analysis goals:
    • Develop overweight/obesity prediction model utilizing patient lifestyle and behavioral information, excluding weight
    • Develop weight independent prediction models for overweight/obesity to allow for prediction of those at risk based on behavioral & lifestyle choices


Can you pay me back? | [Report] & [Presentation]
Loan Default Borrower Data Clustering & Analysis [click for more details]

  • Report & supporting presentation
  • Synthetic data. Medium data set (approx. 1000 observations)
  • K means clustering, agglomerative hierarchical data clustering, principal component analysis, local outlier factor analysis


Blind Breast Biopsies | [Report]
Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete Biopsy [click for more details]

  • Synthetic data. Small data set (approx. 600 observations)
  • Logistic predictive modeling, single-layer neural network model (perceptron) prediction, decision tree algorithms, ROC analysis, and bagging


Linear & Logistic Regression Modeling Using Machine learning Approaches

A Blind Taste Test | [Report]
Predicting sommelier quality rankings of wine and wine type (red/white) utilizing chemical composition alone [click for more details]

  • Real world data; medium data set (approx. 6,500 observations)
  • Regularized linear & logistic regression prediction models, support vector machine (SVM) regression and classification models
  • Coefficient path analysis, hyperparameter tuning, and optimal cut off probability analysis for optimized accuracy and specificity.
  • Analysis goals:
    • Identifying the best possible prediction model for sommelier wine quality ranking based on chemical composition
    • Creating a white vs red prediction model for blend wines that do not neatly fall into either category


Hold That Plane! | [Report]
An analisis of global flight delay data and prediction model creation to reduce delays and increase live prediction capabilities [click for more details]

  • Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  • Multiple Imputation by Chained Equations, Ruben’s rules, Recursive feature elimination, Principal Component Analysis, k-Means Clustering
  • Forward elimination model creation, cross validation, Receiver Operating Characteristic (ROC) comparison
  • Analysis goals:
    • Identifying the variables with greatest impact on overall flight delay for subsequent mitigation and preventative efforts
    • Live flight delay prediction based on available flight data prior to landing (predicting total delay time before flights depart)


Regression

Regression Analysis & Model Creation

Can We Predict the Future in Health? | [Report]
Framingham Heart Study Model Fitting and Cross Validatiion for BMI Predictions [click for more details]

  • Real world data. Medium data set (approx. 4,200 observations)
  • Linear regression model creation, box-cox transformation, mean square error (MSE) cross validation, logistic regression model creation, receiver operating characteristic (ROC) and area under the curve (AUC) model comparison


Dominating the Lies in the IIP - DOMI Sub-Scale | [Presentation] & [SAS Code] & [Requirements]
NIDA Cocaine Data Analysis for Improved Weighting and Predictions of IIP-DOMI Sub-scale [click for more details]

  • Real world data. Medium data data set (Approx. 2,700 observations)
  • Analysis of the potential relationship between DOMI score and treatment type with best fit covariates to determine if treatment type impacts patients’ self reported DOMI scores
  • Analysis of the potential relationship between DOMI score and the interaction of treatment type and type of cocaine used (cocaine or crack) to determine if the combination of treatment type and type of cocaine used impacts patients’ self reported DOMI scores
  • Analysis of the potential relationship between DOMI score and a selection of aggregate “home stability” variables, comprised of the interaction between marital status (married or single), employment status, and education status (HS diploma, GED, or lesser education vs greater than HS education) to determine if the combination of home stability variables impacts patients’ self reported DOMI scores


Linear Regression Dashboard

Melbourne Housing Market | [R Shiny FlexDashboard]
Housing market price simple linear regression analysis [click for more details]

  • Real world data. Large data set (over 34,000 observations)
  • Interactive design allowing for variable selection and model manipulation
  • Interactive and input responsive plots utilizing Plot_ly
  • Responsive regression model diagnostics
  • Developed using Shiny/FlexDashboard
    • Note regression model not built for accuracy, developed to showcase dashboard build ability


Presentations

Computer Randomization Algorithms: Drug Development Pipeline | [Presentation]
A background review of various computer driven randomization algorithms in the drug development pipeline [click for more details]

  • Mass Spectrometry plate randomization through block randomization software
  • Integrated Web Randomization Service (IWRS) for patient randomization in clinical trials


Pizzaz.com Speed Dating Comparability Prediction | [Proposal] & [Presentation] & [SAS Code]
Speed Dating Analytics & Online Dating Optimization [click for more details]

  • Synthetic data. Small data set (276 observations)
  • An analysis of variables impacting overall dater interest and linear regression prediction model creation
  • Proposal for data utilization and business improvement


Data Visualization and Exploratory Data Analysis

Exploratory Data Analysis & Variable Relationship Exploration

I’m Going to Miss my Flight! | [Report]
An analysis of flight delay times across airports [click for more details]

  • Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  • Analysis of variables impacting flight delays & solution implementation
  • ggplot & Plot_ly interactive visualizations


Presidential Fitness Test | [Report]
The Impact of Local Education, Poverty, and Unemployment Rates on County Presidential Election Results [click for more details]

  • Four real world data sets utilized. Large data sets (over 72,000 observations)
  • Relational data sets and data aggregation
  • GeoJSON & leaflet Interactive Map, ggplot


Data Visualization

The Price of Longevity | [Report]
An Analysis of Annual Income on Life Expectancy [click for more details]

  • Real world data. Large data set (over 35,000 observations)
  • An analysis of variables impacting average life expectancy
  • ggplot & Plot_ly interactive visualizations


Presidential Election Results Per County Per Year | [Map]
Geo-spatial data interactive mapping in Tableau | [click for more details]

  • Real world data. Large data set (over 72,000 observations)
  • An analysis of Presidential Election Data from 2000 through 2020


National Health and Nutrition Examination Survey (NHANES) | [Storypoint] & [Dashboard]
Public health data analysis in Tableau Storypoint & Dashboard | [click for more details]

  • Real world data. Medium data set (approx. 7,900 observations)
  • An analysis of NHANES data of Smoking, Blood Pressure, and Serum Cholesterol Levels

---
title: "Projects and Presentations: <img src=\"https://nlepera.github.io/sta551/HW01/img/penguin_cute.png\" style=\"float: right; width: 12%\"/>"
subtitle: "Check out the links below to see what I have been working on!"
author:
- name: Natalie LePera
  affiliation: West Chester University | M.S. Applied Statistics, Data Science Concentraton
date: "Last Update: 07 Apr 2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    toc_collapse: yes
    number_sections: no
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_align: center
    df_print: kable
---

```{css, echo = FALSE}
h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: navy;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: navy;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    color: darkred;
    text-align: left;
    font-family: verdana;
}

body {
  background-color:white;
  font-family: verdana;
    }

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
    font-family: verdana;
}

h5 {
  color: navy;
  font-family: verdana;
}

.iframe {
  text-align: center;
}

a:link {
  color: purple ;
  font-family: verdana;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
    font-family: verdana;
}

.td1 {
  font-weight: bold;
  font-family: verdana;
}

th, td {
  border-bottom: 1px solid #ddd;
  text-align: left;
}

tr:hover {background-color: coral;}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<script src="https://elfsightcdn.com/platform.js" async></script>
<div class="elfsight-app-5fc8465d-8b62-465e-934e-71d6d7426965" data-elfsight-app-lazy></div>
  


## Machine Learning, Modeling, & Cross Validation 

#### Unsupervised Learning Algorithms and Predictive modeling

<font class = "td1">Big Back Behavior</font> | <a href="https://rpubs.com/nlepera/sta552_HW05" target = "blank">[Report]</a>
<details><summary>Predicting obesity rankings based on behavioral survey results [click for more details]</summary><p>

  - Real world data; medium data set (approx 2,200 observations)
  - CART regression prediction models, CART classification prediction models, Bootstrap BAGGING regression prediction models, & Bootstrap BAGGING classification prediction models
  - Variable importance analysis, hyperparameter tuning, model pruning, ROC curve analysis & optimal cutoff identification
  - Analysis goals:
    - Develop overweight/obesity prediction model utilizing patient lifestyle and behavioral information, <b>excluding</b> weight
    - Develop weight independent prediction models for overweight/obesity to allow for prediction of those at risk based on behavioral & lifestyle choices
</p></details>
<br>

<font class = "td1">Can you pay me back?</font> | <a href="https://rpubs.com/nlepera/sta551_hw04" target ="blank">[Report]</a> & <a href="https://nlepera.github.io/sta551/FINAL/" target ="blank">[Presentation]</a>
<details><summary>Loan Default Borrower Data Clustering & Analysis [click for more details]</summary><p>

  - Report & supporting presentation
  - Synthetic data. Medium data set (approx. 1000 observations)
  - K means clustering, agglomerative hierarchical data clustering, principal component analysis, local outlier factor analysis
</p></details>
<br>


<font class = "td1">Blind Breast Biopsies</font> | <a href="https://rpubs.com/nlepera/sta551_hw03" target ="blank">[Report]</a>
<details><summary>Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete Biopsy [click for more details]</summary><p>

  - Synthetic data. Small data set (approx. 600 observations)
  - Logistic predictive modeling, single-layer neural network model (perceptron) prediction, decision tree algorithms, ROC analysis, and bagging
</p></details>
<br>

#### Linear & Logistic Regression Modeling Using Machine learning Approaches

<font class = "td1">A Blind Taste Test</font> | <a href="https://rpubs.com/nlepera/sta552_HW04" target = "blank">[Report]</a>
<details><summary>Predicting sommelier quality rankings of wine and wine type (red/white) utilizing chemical composition alone [click for more details]</summary><p>

  - Real world data; medium data set (approx. 6,500 observations)
  - Regularized linear & logistic regression prediction models, support vector machine (SVM) regression and classification models
  - Coefficient path analysis, hyperparameter tuning, and optimal cut off probability analysis for optimized accuracy and specificity.
  - Analysis goals:
      - Identifying the best possible prediction model for sommelier wine quality ranking based on chemical composition
      - Creating a white vs red prediction model for blend wines that do not neatly fall into either category
</p></details>
<br>

  <font class = "td1"> Hold That Plane! </font> | <a href="https://rpubs.com/nlepera/sta552_HW03" target ="blank">[Report]</a> 
<details><summary>An analisis of global flight delay data and prediction model creation to reduce delays and increase live prediction capabilities [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Multiple Imputation by Chained Equations, Ruben's rules, Recursive feature elimination, Principal Component Analysis, k-Means Clustering
  - Forward elimination model creation, cross validation, Receiver Operating Characteristic (ROC) comparison
  - Analysis goals:
      - Identifying the variables with greatest impact on overall flight delay for subsequent mitigation and preventative efforts
      - Live flight delay prediction based on available flight data prior to landing (predicting total delay time before flights depart)
</p>
</details>
<br>


## Regression

#### Regression Analysis & Model Creation

<font class = "td1">Can We Predict the Future in Health?</font> | <a href="https://rpubs.com/nlepera/sta551_hw02" target ="blank">[Report]</a>
<details><summary>Framingham Heart Study Model Fitting and Cross Validatiion for BMI Predictions [click for more details]</summary><p>

  - Real world data. Medium data set (approx. 4,200 observations)
  - Linear regression model creation, box-cox transformation, mean square error (MSE) cross validation, logistic regression model creation, receiver operating characteristic (ROC) and area under the curve (AUC) model comparison
</p></details>
<br>

<font class = "td1">Dominating the Lies in the IIP - DOMI Sub-Scale</font> | <a href="https://nlepera.github.io/sta513/LePera_N_Final%20Project_2025-12-09_IPP%20DOMI.pdf" target ="blank">[Presentation]</a> & <a href="https://nlepera.github.io/sta513/LePera_N_Final%20Project_2025-12-09_IPP%20DOMI%20SAS.pdf" target ="blank">[SAS Code]</a> & <a href = "https://nlepera.github.io/sta513/LePera_N_Final%20Project_2025-12-09_Instructions.pdf" target ="blank">[Requirements]</a>
<details><summary>NIDA Cocaine Data Analysis for Improved Weighting and Predictions of IIP-DOMI Sub-scale [click for more details]</summary><p>

  - Real world data. Medium data data set (Approx. 2,700 observations)
  - Analysis of the potential relationship between DOMI score and treatment type with best fit covariates to determine
if treatment type impacts patients’ self reported DOMI scores
  - Analysis of the potential relationship between DOMI score and the interaction of treatment type and type of
cocaine used (cocaine or crack) to determine if the combination of treatment type and type of cocaine used
impacts patients’ self reported DOMI scores
  - Analysis of the potential relationship between DOMI score and a selection of aggregate “home stability” variables,
comprised of the interaction between marital status (married or single), employment status, and education status
(HS diploma, GED, or lesser education vs greater than HS education) to determine if the combination of home
stability variables impacts patients’ self reported DOMI scores
</p></details>
<br>

#### Linear Regression Dashboard

<font class = "td1">Melbourne Housing Market</font> | <a href="https://nlepera.shinyapps.io/final/" target ="blank">[R Shiny FlexDashboard]</a>
<details><summary>Housing market price simple linear regression analysis [click for more details]</summary><p>

  - Real world data.  Large data set (over 34,000 observations)
  - Interactive design allowing for variable selection and model manipulation
  - Interactive and input responsive plots utilizing Plot_ly
  - Responsive regression model diagnostics
  - Developed using Shiny/FlexDashboard
    - Note regression model not built for accuracy, developed to showcase dashboard build ability
</p></details>
<br>


## Presentations

<font class = "td1">Computer Randomization Algorithms: Drug Development Pipeline</font> | <a href="https://nlepera.github.io/sta12/LePera_N_Project 1_2026-04-06.pdf" target ="blank">[Presentation]</a> 
<details><summary>A background review of various computer driven randomization algorithms in the drug development pipeline [click for more details]</summary><p>

  - Mass Spectrometry plate randomization through block randomization software
  - Integrated Web Randomization Service (IWRS) for patient randomization in clinical trials
</p></details>
<br>


<font class = "td1">Pizzaz.com Speed Dating Comparability Prediction</font> | <a href="https://nlepera.github.io/sta511/LePera_N_Exam%202_2025.04.22.pdf" target ="blank">[Proposal]</a> & <a href="https://nlepera.github.io/sta511/exam2_pdf.pdf" target ="blank">[Presentation]</a> & <a href = "https://nlepera.github.io/sta511/exam2_SAS.pdf" target="blank">[SAS Code]</a>
<details><summary>Speed Dating Analytics & Online Dating Optimization [click for more details]</summary><p>

  - Synthetic data. Small data set (276 observations)
  - An analysis of variables impacting overall dater interest and linear regression prediction model creation
  - Proposal for data utilization and business improvement
</p></details>
<br>


## Data Visualization and Exploratory Data Analysis

#### Exploratory Data Analysis & Variable Relationship Exploration

<font class = "td1">I'm Going to Miss my Flight!</font> | <a href="https://nlepera.github.io/sta552/HW01/index.html" target ="blank">[Report]</a>
<details><summary>An analysis of flight delay times across airports [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Analysis of variables impacting flight delays & solution implementation
  - ggplot & Plot_ly interactive visualizations 
</p></details>
<br>

<font class = "td1">Presidential Fitness Test</font> | <a href="https://rpubs.com/nlepera/sta551_hw01" target ="blank">[Report]</a>
<details><summary>The Impact of Local Education, Poverty, and Unemployment Rates on County Presidential Election Results [click for more details]</summary><p>

  - Four real world data sets utilized. Large data sets (over 72,000 observations)
  - Relational data sets and data aggregation
  - GeoJSON & leaflet Interactive Map, ggplot
</p></details>
<br>

#### Data Visualization

<font class = "td1">The Price of Longevity</font> | <a href="https://nlepera.github.io/sta553/w06_plotly/" target ="blank">[Report]</a>
<details><summary>An Analysis of Annual Income on Life Expectancy [click for more details]</summary><p>

  - Real world data. Large data set (over 35,000 observations)
  - An analysis of variables impacting average life expectancy
  - ggplot & Plot_ly interactive visualizations
</p></details>
<br>
  
<font class = "td1">Presidential Election Results Per County Per Year</font> | <a href="https://public.tableau.com/views/PresidentialElectionResultsPerCountyPerYear/Sheet1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Map]</a>
<details><summary>Geo-spatial data interactive mapping in Tableau | [click for more details]</summary><p>

  - Real world data. Large data set (over 72,000 observations)
  - An analysis of Presidential Election Data from 2000 through 2020
</p></details>
<br>

<font class = "td1">National Health and Nutrition Examination Survey (NHANES)</font> | <a href="https://public.tableau.com/views/NHANESHealthData-Storypoint/SmokingHealth?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Storypoint]</a> & <a href="https://public.tableau.com/views/NHANESHealthData/SmokingDashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Dashboard]</a>
<details><summary>Public health data analysis in Tableau Storypoint & Dashboard | [click for more details]</summary><p>

  - Real world data. Medium data set (approx. 7,900 observations)
  - An analysis of NHANES data of Smoking, Blood Pressure, and Serum Cholesterol Levels
</p></details>

