Machine Learning, Modeling, & Cross Validation
Unsupervised Learning Algorithms and Predictive modeling
Big Back Behavior |
[Report]
Predicting obesity rankings based on behavioral survey results [click
for more details]
- Real world data; medium data set (approx 2,200 observations)
- CART regression prediction models, CART classification prediction
models, Bootstrap BAGGING regression prediction models, & Bootstrap
BAGGING classification prediction models
- Variable importance analysis, hyperparameter tuning, model pruning,
ROC curve analysis & optimal cutoff identification
- Analysis goals:
- Develop overweight/obesity prediction model utilizing patient
lifestyle and behavioral information, excluding weight
- Develop weight independent prediction models for overweight/obesity
to allow for prediction of those at risk based on behavioral &
lifestyle choices
Can you pay me back? |
[Report]
&
[Presentation]
Loan Default Borrower Data Clustering & Analysis [click for more
details]
- Report & supporting presentation
- Synthetic data. Medium data set (approx. 1000 observations)
- K means clustering, agglomerative hierarchical data clustering,
principal component analysis, local outlier factor analysis
Blind Breast Biopsies |
[Report]
Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete
Biopsy [click for more details]
- Synthetic data. Small data set (approx. 600 observations)
- Logistic predictive modeling, single-layer neural network model
(perceptron) prediction, decision tree algorithms, ROC analysis, and
bagging
Linear & Logistic Regression Modeling Using Machine learning
Approaches
A Blind Taste Test |
[Report]
Predicting sommelier quality rankings of wine and wine type (red/white)
utilizing chemical composition alone [click for more details]
- Real world data; medium data set (approx. 6,500 observations)
- Regularized linear & logistic regression prediction models,
support vector machine (SVM) regression and classification models
- Coefficient path analysis, hyperparameter tuning, and optimal cut
off probability analysis for optimized accuracy and specificity.
- Analysis goals:
- Identifying the best possible prediction model for sommelier wine
quality ranking based on chemical composition
- Creating a white vs red prediction model for blend wines that do not
neatly fall into either category
Hold That Plane! |
[Report]
An analisis of global flight delay data and prediction model creation to
reduce delays and increase live prediction capabilities [click for more
details]
- Synthetic data with forced missing values. Medium data set (over
3,500 observations)
- Multiple Imputation by Chained Equations, Ruben’s rules, Recursive
feature elimination, Principal Component Analysis, k-Means
Clustering
- Forward elimination model creation, cross validation, Receiver
Operating Characteristic (ROC) comparison
- Analysis goals:
- Identifying the variables with greatest impact on overall flight
delay for subsequent mitigation and preventative efforts
- Live flight delay prediction based on available flight data prior to
landing (predicting total delay time before flights depart)
Regression
Regression Analysis & Model Creation
Can We Predict the Future in Health? |
[Report]
Framingham Heart Study Model Fitting and Cross Validatiion for BMI
Predictions [click for more details]
- Real world data. Medium data set (approx. 4,200 observations)
- Linear regression model creation, box-cox transformation, mean
square error (MSE) cross validation, logistic regression model creation,
receiver operating characteristic (ROC) and area under the curve (AUC)
model comparison
Dominating the Lies in the IIP - DOMI
Sub-Scale |
[Presentation]
&
[SAS
Code] &
[Requirements]
NIDA Cocaine Data Analysis for Improved Weighting and Predictions of
IIP-DOMI Sub-scale [click for more details]
- Real world data. Medium data data set (Approx. 2,700
observations)
- Analysis of the potential relationship between DOMI score and
treatment type with best fit covariates to determine if treatment type
impacts patients’ self reported DOMI scores
- Analysis of the potential relationship between DOMI score and the
interaction of treatment type and type of cocaine used (cocaine or
crack) to determine if the combination of treatment type and type of
cocaine used impacts patients’ self reported DOMI scores
- Analysis of the potential relationship between DOMI score and a
selection of aggregate “home stability” variables, comprised of the
interaction between marital status (married or single), employment
status, and education status (HS diploma, GED, or lesser education vs
greater than HS education) to determine if the combination of home
stability variables impacts patients’ self reported DOMI scores
Linear Regression Dashboard
Melbourne Housing Market |
[R Shiny
FlexDashboard]
Housing market price simple linear regression analysis [click for more
details]
- Real world data. Large data set (over 34,000 observations)
- Interactive design allowing for variable selection and model
manipulation
- Interactive and input responsive plots utilizing Plot_ly
- Responsive regression model diagnostics
- Developed using Shiny/FlexDashboard
- Note regression model not built for accuracy, developed to showcase
dashboard build ability
Presentations
Computer Randomization Algorithms: Drug Development
Pipeline |
[Presentation]
A background review of various computer driven randomization algorithms
in the drug development pipeline [click for more details]
- Mass Spectrometry plate randomization through block randomization
software
- Integrated Web Randomization Service (IWRS) for patient
randomization in clinical trials
Pizzaz.com Speed Dating Comparability
Prediction |
[Proposal]
&
[Presentation]
&
[SAS
Code]
Speed Dating Analytics & Online Dating Optimization [click for more
details]
- Synthetic data. Small data set (276 observations)
- An analysis of variables impacting overall dater interest and linear
regression prediction model creation
- Proposal for data utilization and business improvement
Data Visualization and Exploratory Data Analysis
Exploratory Data Analysis & Variable Relationship
Exploration
I’m Going to Miss my Flight! |
[Report]
An analysis of flight delay times across airports [click for more
details]
- Synthetic data with forced missing values. Medium data set (over
3,500 observations)
- Analysis of variables impacting flight delays & solution
implementation
- ggplot & Plot_ly interactive visualizations
Presidential Fitness Test |
[Report]
The Impact of Local Education, Poverty, and Unemployment Rates on County
Presidential Election Results [click for more details]
- Four real world data sets utilized. Large data sets (over 72,000
observations)
- Relational data sets and data aggregation
- GeoJSON & leaflet Interactive Map, ggplot
Data Visualization
The Price of Longevity |
[Report]
An Analysis of Annual Income on Life Expectancy [click for more details]
- Real world data. Large data set (over 35,000 observations)
- An analysis of variables impacting average life expectancy
- ggplot & Plot_ly interactive visualizations
Presidential Election Results Per County Per
Year |
[Map]
Geo-spatial data interactive mapping in Tableau | [click for more
details]
- Real world data. Large data set (over 72,000 observations)
- An analysis of Presidential Election Data from 2000 through
2020
National Health and Nutrition Examination Survey
(NHANES) |
[Storypoint]
&
[Dashboard]
Public health data analysis in Tableau Storypoint & Dashboard |
[click for more details]
- Real world data. Medium data set (approx. 7,900 observations)
- An analysis of NHANES data of Smoking, Blood Pressure, and Serum
Cholesterol Levels
---
title: "Projects and Presentations: <img src=\"https://nlepera.github.io/sta551/HW01/img/penguin_cute.png\" style=\"float: right; width: 12%\"/>"
subtitle: "Check out the links below to see what I have been working on!"
author:
- name: Natalie LePera
  affiliation: West Chester University | M.S. Applied Statistics, Data Science Concentraton
date: "Last Update: 07 Apr 2026"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    toc_collapse: yes
    number_sections: no
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_align: center
    df_print: kable
---

```{css, echo = FALSE}
h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: navy;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: navy;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    color: darkred;
    text-align: left;
    font-family: verdana;
}

body {
  background-color:white;
  font-family: verdana;
    }

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
    font-family: verdana;
}

h5 {
  color: navy;
  font-family: verdana;
}

.iframe {
  text-align: center;
}

a:link {
  color: purple ;
  font-family: verdana;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
    font-family: verdana;
}

.td1 {
  font-weight: bold;
  font-family: verdana;
}

th, td {
  border-bottom: 1px solid #ddd;
  text-align: left;
}

tr:hover {background-color: coral;}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<script src="https://elfsightcdn.com/platform.js" async></script>
<div class="elfsight-app-5fc8465d-8b62-465e-934e-71d6d7426965" data-elfsight-app-lazy></div>
  


## Machine Learning, Modeling, & Cross Validation 

#### Unsupervised Learning Algorithms and Predictive modeling

<font class = "td1">Big Back Behavior</font> | <a href="https://rpubs.com/nlepera/sta552_HW05" target = "blank">[Report]</a>
<details><summary>Predicting obesity rankings based on behavioral survey results [click for more details]</summary><p>

  - Real world data; medium data set (approx 2,200 observations)
  - CART regression prediction models, CART classification prediction models, Bootstrap BAGGING regression prediction models, & Bootstrap BAGGING classification prediction models
  - Variable importance analysis, hyperparameter tuning, model pruning, ROC curve analysis & optimal cutoff identification
  - Analysis goals:
    - Develop overweight/obesity prediction model utilizing patient lifestyle and behavioral information, <b>excluding</b> weight
    - Develop weight independent prediction models for overweight/obesity to allow for prediction of those at risk based on behavioral & lifestyle choices
</p></details>
<br>

<font class = "td1">Can you pay me back?</font> | <a href="https://rpubs.com/nlepera/sta551_hw04" target ="blank">[Report]</a> & <a href="https://nlepera.github.io/sta551/FINAL/" target ="blank">[Presentation]</a>
<details><summary>Loan Default Borrower Data Clustering & Analysis [click for more details]</summary><p>

  - Report & supporting presentation
  - Synthetic data. Medium data set (approx. 1000 observations)
  - K means clustering, agglomerative hierarchical data clustering, principal component analysis, local outlier factor analysis
</p></details>
<br>


<font class = "td1">Blind Breast Biopsies</font> | <a href="https://rpubs.com/nlepera/sta551_hw03" target ="blank">[Report]</a>
<details><summary>Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete Biopsy [click for more details]</summary><p>

  - Synthetic data. Small data set (approx. 600 observations)
  - Logistic predictive modeling, single-layer neural network model (perceptron) prediction, decision tree algorithms, ROC analysis, and bagging
</p></details>
<br>

#### Linear & Logistic Regression Modeling Using Machine learning Approaches

<font class = "td1">A Blind Taste Test</font> | <a href="https://rpubs.com/nlepera/sta552_HW04" target = "blank">[Report]</a>
<details><summary>Predicting sommelier quality rankings of wine and wine type (red/white) utilizing chemical composition alone [click for more details]</summary><p>

  - Real world data; medium data set (approx. 6,500 observations)
  - Regularized linear & logistic regression prediction models, support vector machine (SVM) regression and classification models
  - Coefficient path analysis, hyperparameter tuning, and optimal cut off probability analysis for optimized accuracy and specificity.
  - Analysis goals:
      - Identifying the best possible prediction model for sommelier wine quality ranking based on chemical composition
      - Creating a white vs red prediction model for blend wines that do not neatly fall into either category
</p></details>
<br>

  <font class = "td1"> Hold That Plane! </font> | <a href="https://rpubs.com/nlepera/sta552_HW03" target ="blank">[Report]</a> 
<details><summary>An analisis of global flight delay data and prediction model creation to reduce delays and increase live prediction capabilities [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Multiple Imputation by Chained Equations, Ruben's rules, Recursive feature elimination, Principal Component Analysis, k-Means Clustering
  - Forward elimination model creation, cross validation, Receiver Operating Characteristic (ROC) comparison
  - Analysis goals:
      - Identifying the variables with greatest impact on overall flight delay for subsequent mitigation and preventative efforts
      - Live flight delay prediction based on available flight data prior to landing (predicting total delay time before flights depart)
</p>
</details>
<br>


## Regression

#### Regression Analysis & Model Creation

<font class = "td1">Can We Predict the Future in Health?</font> | <a href="https://rpubs.com/nlepera/sta551_hw02" target ="blank">[Report]</a>
<details><summary>Framingham Heart Study Model Fitting and Cross Validatiion for BMI Predictions [click for more details]</summary><p>

  - Real world data. Medium data set (approx. 4,200 observations)
  - Linear regression model creation, box-cox transformation, mean square error (MSE) cross validation, logistic regression model creation, receiver operating characteristic (ROC) and area under the curve (AUC) model comparison
</p></details>
<br>

<font class = "td1">Dominating the Lies in the IIP - DOMI Sub-Scale</font> | <a href="https://nlepera.github.io/sta513/LePera_N_Final%20Project_2025-12-09_IPP%20DOMI.pdf" target ="blank">[Presentation]</a> & <a href="https://nlepera.github.io/sta513/LePera_N_Final%20Project_2025-12-09_IPP%20DOMI%20SAS.pdf" target ="blank">[SAS Code]</a> & <a href = "https://nlepera.github.io/sta513/LePera_N_Final%20Project_2025-12-09_Instructions.pdf" target ="blank">[Requirements]</a>
<details><summary>NIDA Cocaine Data Analysis for Improved Weighting and Predictions of IIP-DOMI Sub-scale [click for more details]</summary><p>

  - Real world data. Medium data data set (Approx. 2,700 observations)
  - Analysis of the potential relationship between DOMI score and treatment type with best fit covariates to determine
if treatment type impacts patients’ self reported DOMI scores
  - Analysis of the potential relationship between DOMI score and the interaction of treatment type and type of
cocaine used (cocaine or crack) to determine if the combination of treatment type and type of cocaine used
impacts patients’ self reported DOMI scores
  - Analysis of the potential relationship between DOMI score and a selection of aggregate “home stability” variables,
comprised of the interaction between marital status (married or single), employment status, and education status
(HS diploma, GED, or lesser education vs greater than HS education) to determine if the combination of home
stability variables impacts patients’ self reported DOMI scores
</p></details>
<br>

#### Linear Regression Dashboard

<font class = "td1">Melbourne Housing Market</font> | <a href="https://nlepera.shinyapps.io/final/" target ="blank">[R Shiny FlexDashboard]</a>
<details><summary>Housing market price simple linear regression analysis [click for more details]</summary><p>

  - Real world data.  Large data set (over 34,000 observations)
  - Interactive design allowing for variable selection and model manipulation
  - Interactive and input responsive plots utilizing Plot_ly
  - Responsive regression model diagnostics
  - Developed using Shiny/FlexDashboard
    - Note regression model not built for accuracy, developed to showcase dashboard build ability
</p></details>
<br>


## Presentations

<font class = "td1">Computer Randomization Algorithms: Drug Development Pipeline</font> | <a href="https://nlepera.github.io/sta12/LePera_N_Project 1_2026-04-06.pdf" target ="blank">[Presentation]</a> 
<details><summary>A background review of various computer driven randomization algorithms in the drug development pipeline [click for more details]</summary><p>

  - Mass Spectrometry plate randomization through block randomization software
  - Integrated Web Randomization Service (IWRS) for patient randomization in clinical trials
</p></details>
<br>


<font class = "td1">Pizzaz.com Speed Dating Comparability Prediction</font> | <a href="https://nlepera.github.io/sta511/LePera_N_Exam%202_2025.04.22.pdf" target ="blank">[Proposal]</a> & <a href="https://nlepera.github.io/sta511/exam2_pdf.pdf" target ="blank">[Presentation]</a> & <a href = "https://nlepera.github.io/sta511/exam2_SAS.pdf" target="blank">[SAS Code]</a>
<details><summary>Speed Dating Analytics & Online Dating Optimization [click for more details]</summary><p>

  - Synthetic data. Small data set (276 observations)
  - An analysis of variables impacting overall dater interest and linear regression prediction model creation
  - Proposal for data utilization and business improvement
</p></details>
<br>


## Data Visualization and Exploratory Data Analysis

#### Exploratory Data Analysis & Variable Relationship Exploration

<font class = "td1">I'm Going to Miss my Flight!</font> | <a href="https://nlepera.github.io/sta552/HW01/index.html" target ="blank">[Report]</a>
<details><summary>An analysis of flight delay times across airports [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Analysis of variables impacting flight delays & solution implementation
  - ggplot & Plot_ly interactive visualizations 
</p></details>
<br>

<font class = "td1">Presidential Fitness Test</font> | <a href="https://rpubs.com/nlepera/sta551_hw01" target ="blank">[Report]</a>
<details><summary>The Impact of Local Education, Poverty, and Unemployment Rates on County Presidential Election Results [click for more details]</summary><p>

  - Four real world data sets utilized. Large data sets (over 72,000 observations)
  - Relational data sets and data aggregation
  - GeoJSON & leaflet Interactive Map, ggplot
</p></details>
<br>

#### Data Visualization

<font class = "td1">The Price of Longevity</font> | <a href="https://nlepera.github.io/sta553/w06_plotly/" target ="blank">[Report]</a>
<details><summary>An Analysis of Annual Income on Life Expectancy [click for more details]</summary><p>

  - Real world data. Large data set (over 35,000 observations)
  - An analysis of variables impacting average life expectancy
  - ggplot & Plot_ly interactive visualizations
</p></details>
<br>
  
<font class = "td1">Presidential Election Results Per County Per Year</font> | <a href="https://public.tableau.com/views/PresidentialElectionResultsPerCountyPerYear/Sheet1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Map]</a>
<details><summary>Geo-spatial data interactive mapping in Tableau | [click for more details]</summary><p>

  - Real world data. Large data set (over 72,000 observations)
  - An analysis of Presidential Election Data from 2000 through 2020
</p></details>
<br>

<font class = "td1">National Health and Nutrition Examination Survey (NHANES)</font> | <a href="https://public.tableau.com/views/NHANESHealthData-Storypoint/SmokingHealth?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Storypoint]</a> & <a href="https://public.tableau.com/views/NHANESHealthData/SmokingDashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Dashboard]</a>
<details><summary>Public health data analysis in Tableau Storypoint & Dashboard | [click for more details]</summary><p>

  - Real world data. Medium data set (approx. 7,900 observations)
  - An analysis of NHANES data of Smoking, Blood Pressure, and Serum Cholesterol Levels
</p></details>

