R Markdown Projects
Modeling, Cross Validation, & Machine Learning
Linear & Logistic Regression Modeling Using Machine learning
Approaches
A Blind Taste Test |
[Report]
Predicting sommelier quality rankings of wine and wine type (red/white)
utilizing chemical composition alone [click for more details]
- Real world data; medium data set (approx. 6,500 observations)
- Regularized linear & logistic regression prediction models,
support vector machine (SVM) regression and classification models
- Coefficient path analysis, hyperparameter tuning, and optimal cut
off probability analysis for optimized accuracy and specificity.
- Analysis goals:
- Identifying the best possible prediction model for sommelier wine
quality ranking based on chemical composition
- Creating a white vs red prediction model for blend wines that do not
neatly fall into either category
Hold That Plane! |
[Report]
An analisis of global flight delay data and prediction model creation to
reduce delays and increase live prediction capabilities [click for more
details]
- Synthetic data with forced missing values. Medium data set (over
3,500 observations)
- Multiple Imputation by Chained Equations, Ruben’s rules, Recursive
feature elimination, Principal Component Analysis, k-Means
Clustering
- Forward elimination model creation, cross validation, Receiver
Operating Characteristic (ROC) comparison
- Analysis goals:
- Identifying the variables with greatest impact on overall flight
delay for subsequent mitigation and preventative efforts
- Live flight delay prediction based on available flight data prior to
landing (predicting total delay time before flights depart)
Unsupervised Learning Algorithms and Predictive modeling
Big Back Behavior ]
[Report]
Predicting obesity rankings based on behavioral survey results [click
for more details]
- Real world data; medium data set (approx 2,200 observations)
- CART regression prediction models, CART classification prediction
models, Bootstrap BAGGING regression prediction models, & Bootstrap
BAGGING classification prediction models
- Variable importance analysis, hyperparameter tuning, model pruning,
ROC curve analysis & optimal cutoff identification
- Analysis goals:
- Develop overweight/obesity prediction model utilizing patient
lifestyle and behavioral information, excluding weight
- Develop weight independent prediction models for overweight/obesity
to allow for prediction of those at risk based on behavioral &
lifestyle choices
Can you pay me back? |
[Report]
&
[Presentation]
Loan Default Borrower Data Clustering & Analysis [click for more
details]
- Report & supporting presentation
- Synthetic data. Medium data set (approx. 1000 observations)
- K means clustering, agglomerative hierarchical data clustering,
principal component analysis, local outlier factor analysis
Blind Breast Biopsies |
[Report]
Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete
Biopsy [click for more details]
- Synthetic data. Small data set (approx. 600 observations)
- Logistic predictive modeling, single-layer neural network model
(perceptron) prediction, decision tree algorithms, ROC analysis, and
bagging
Regression Analysis & Model Creation
Can We Predict the Future in Health? |
[Report]
Framingham Heart Study Model Fitting and Cross Validatiion for BMI
Predictions [click for more details]
- Real world data. Medium data set (approx. 4,200 observations)
- Linear regression model creation, box-cox transformation, mean
square error (MSE) cross validation, logistic regression model creation,
receiver operating characteristic (ROC) and area under the curve (AUC)
model comparison
Data Visualization and Exploratory Data Analysis
Exploratory Data Analysis & Variable Relationship
Exploration
I’m Going to Miss my Flight! |
[Report]
An analysis of flight delay times across airports [click for more
details]
- Synthetic data with forced missing values. Medium data set (over
3,500 observations)
- Analysis of variables impacting flight delays & solution
implementation
- ggplot & Plot_ly interactive visualizations
Exploratory Data Analysis
Presidential Fitness Test |
[Report]
The Impact of Local Education, Poverty, and Unemployment Rates on County
Presidential Election Results [click for more details]
- Four real world data sets utilized. Large data sets (over 72,000
observations)
- Relational data sets and data aggregation
- GeoJSON & leaflet Interactive Map, ggplot
Time Series Data
The Price of Longevity |
[Report]
An Analysis of Annual Income on Life Expectancy [click for more details]
- Real world data. Large data set (over 35,000 observations)
- An analysis of variables impacting average life expectancy
- ggplot & Plot_ly interactive visualizations
SAS Projects
Linear Regression Analysis
Buisness Analytics Proposal
Pizzaz.com Speed Dating Comparability
Prediction |
[Proposal]
&
[Presentation]
&
[SAS
Code]
Speed Dating Analytics & Online Dating Optimization [click for more
details]
- Synthetic data. Small data set (276 observations)
- An analysis of variables impacting overall dater interest and linear
regression prediction model creation
- Proposal for data utilization and business improvement
Shiny/FlexDashboard Projects
Linear Regression Dashboard
Melbourne Housing Market |
[R Shiny
FlexDashboard]
[click for more details]
- Real world data. Large data set (over 34,000 observations)
- Housing market price regression analysis
Tableau Projects
Geospatial Data
Interactive Mapping
Presidential Election Results Per County Per
Year |
[Map]
[click for more details]
- Real world data. Large data set (over 72,000 observations)
- An analysis of Presidential Election Data from 2000 through
2020
Public Health Data
Tableau Storypoint & Dahsboard
National Health and Nutrition Examination Survey
(NHANES) |
[Storypoint]
&
[Dashboard]
[click for more details]
- Real world data. Medium data set (approx. 7,900 observations)
- An analysis of NHANES data of Smoking, Blood Pressure, and Serum
Cholesterol Levels
---
title: "Projects and Presentations :<img src=\"https://nlepera.github.io/sta551/HW01/img/penguin_cute.png\" style=\"float: right; width: 12%\"/>"
subtitle: "Links to relevant coursework projects and presentations"
author:
- name: Natalie LePera
  affiliation: West Chester University | M.S. Applied Statistics, Data Science Concentraton
date: 
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    toc_collapse: yes
    number_sections: no
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_align: center
    df_print: kable
---

```{css, echo = FALSE}
h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: navy;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: navy;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    color: darkred;
    text-align: left;
    font-family: verdana;
}

body {
  background-color:white;
  font-family: verdana;
    }

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
    font-family: verdana;
}

h5 {
  color: navy;
  font-family: verdana;
}

.iframe {
  text-align: center;
}

a:link {
  color: black;
  font-family: verdana;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
    font-family: verdana;
}

.td1 {
  font-weight: bold;
  font-family: verdana;
}

th, td {
  border-bottom: 1px solid #ddd;
  text-align: left;
}

tr:hover {background-color: coral;}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```



## R Markdown Projects

### Modeling, Cross Validation, & Machine Learning

#### Linear & Logistic Regression Modeling Using Machine learning Approaches

<font class = "td1">A Blind Taste Test</font> | <a href="https://rpubs.com/nlepera/sta552_HW04" target = "blank">[Report]</a>
<details><summary>Predicting sommelier quality rankings of wine and wine type (red/white) utilizing chemical composition alone [click for more details]</summary><p>

  - Real world data; medium data set (approx. 6,500 observations)
  - Regularized linear & logistic regression prediction models, support vector machine (SVM) regression and classification models
  - Coefficient path analysis, hyperparameter tuning, and optimal cut off probability analysis for optimized accuracy and specificity.
  - Analysis goals:
      - Identifying the best possible prediction model for sommelier wine quality ranking based on chemical composition
      - Creating a white vs red prediction model for blend wines that do not neatly fall into either category
</p></details>
<br>

  <font class = "td1"> Hold That Plane! </font> | <a href="https://rpubs.com/nlepera/sta552_HW03" target ="blank">[Report]</a> 
<details><summary>An analisis of global flight delay data and prediction model creation to reduce delays and increase live prediction capabilities [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Multiple Imputation by Chained Equations, Ruben's rules, Recursive feature elimination, Principal Component Analysis, k-Means Clustering
  - Forward elimination model creation, cross validation, Receiver Operating Characteristic (ROC) comparison
  - Analysis goals:
      - Identifying the variables with greatest impact on overall flight delay for subsequent mitigation and preventative efforts
      - Live flight delay prediction based on available flight data prior to landing (predicting total delay time before flights depart)
</p>
</details>
<br>

#### Unsupervised Learning Algorithms and Predictive modeling

<font class = "td1">Big Back Behavior</font> ] <a href="https://rpubs.com/nlepera/sta552_HW05" target = "blank">[Report]</a>
<details><summary>Predicting obesity rankings based on behavioral survey results [click for more details]</summary><p>

  - Real world data; medium data set (approx 2,200 observations)
  - CART regression prediction models, CART classification prediction models, Bootstrap BAGGING regression prediction models, & Bootstrap BAGGING classification prediction models
  - Variable importance analysis, hyperparameter tuning, model pruning, ROC curve analysis & optimal cutoff identification
  - Analysis goals:
    - Develop overweight/obesity prediction model utilizing patient lifestyle and behavioral information, <b>excluding</b> weight
    - Develop weight independent prediction models for overweight/obesity to allow for prediction of those at risk based on behavioral & lifestyle choices
</p></details>
<br>

<font class = "td1">Can you pay me back?</font> | <a href="https://rpubs.com/nlepera/sta551_hw04" target ="blank">[Report]</a> & <a href="https://nlepera.github.io/sta551/FINAL/" target ="blank">[Presentation]</a>
<details><summary>Loan Default Borrower Data Clustering & Analysis [click for more details]</summary><p>

  - Report & supporting presentation
  - Synthetic data. Medium data set (approx. 1000 observations)
  - K means clustering, agglomerative hierarchical data clustering, principal component analysis, local outlier factor analysis
</p></details>
<br>


<font class = "td1">Blind Breast Biopsies</font> | <a href="https://rpubs.com/nlepera/sta551_hw03" target ="blank">[Report]</a>
<details><summary>Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete Biopsy [click for more details]</summary><p>

  - Synthetic data. Small data set (approx. 600 observations)
  - Logistic predictive modeling, single-layer neural network model (perceptron) prediction, decision tree algorithms, ROC analysis, and bagging
</p></details>
<br>

#### Regression Analysis & Model Creation

<font class = "td1">Can We Predict the Future in Health?</font> | <a href="https://rpubs.com/nlepera/sta551_hw02" target ="blank">[Report]</a>
<details><summary>Framingham Heart Study Model Fitting and Cross Validatiion for BMI Predictions [click for more details]</summary><p>

  - Real world data. Medium data set (approx. 4,200 observations)
  - Linear regression model creation, box-cox transformation, mean square error (MSE) cross validation, logistic regression model creation, receiver operating characteristic (ROC) and area under the curve (AUC) model comparison
</p></details>
<br>


### Data Visualization and Exploratory Data Analysis

#### Exploratory Data Analysis & Variable Relationship Exploration

<font class = "td1">I'm Going to Miss my Flight!</font> | <a href="https://nlepera.github.io/sta552/HW01/index.html" target ="blank">[Report]</a>
<details><summary>An analysis of flight delay times across airports [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Analysis of variables impacting flight delays & solution implementation
  - ggplot & Plot_ly interactive visualizations 
</p></details>
<br>

#### Exploratory Data Analysis

<font class = "td1">Presidential Fitness Test</font> | <a href="https://rpubs.com/nlepera/sta551_hw01" target ="blank">[Report]</a>
<details><summary>The Impact of Local Education, Poverty, and Unemployment Rates on County Presidential Election Results [click for more details]</summary><p>

  - Four real world data sets utilized. Large data sets (over 72,000 observations)
  - Relational data sets and data aggregation
  - GeoJSON & leaflet Interactive Map, ggplot
</p></details>
<br>

#### Time Series Data

<font class = "td1">The Price of Longevity</font> | <a href="https://nlepera.github.io/sta553/w06_plotly/" target ="blank">[Report]</a>
<details><summary>An Analysis of Annual Income on Life Expectancy [click for more details]</summary><p>

  - Real world data. Large data set (over 35,000 observations)
  - An analysis of variables impacting average life expectancy
  - ggplot & Plot_ly interactive visualizations
</p></details>
<br>
  
## SAS Projects

### Linear Regression Analysis 

#### Buisness Analytics Proposal

<font class = "td1">Pizzaz.com Speed Dating Comparability Prediction</font> | <a href="https://nlepera.github.io/sta511/LePera_N_Exam%202_2025.04.22.pdf" target ="blank">[Proposal]</a> & <a href="https://nlepera.github.io/sta511/exam2_pdf.pdf" target ="blank">[Presentation]</a> & <a href = "https://nlepera.github.io/sta511/exam2_SAS.pdf" target="blank">[SAS Code]</a>
<details><summary>Speed Dating Analytics & Online Dating Optimization [click for more details]</summary><p>

  - Synthetic data. Small data set (276 observations)
  - An analysis of variables impacting overall dater interest and linear regression prediction model creation
  - Proposal for data utilization and business improvement
</p></details>
<br>
  
  
## Shiny/FlexDashboard Projects

#### Linear Regression Dashboard

<font class = "td1">Melbourne Housing Market</font> | <a href="https://nlepera.shinyapps.io/final/" target ="blank">[R Shiny FlexDashboard]</a>
<details><summary>[click for more details]</summary><p>

  - Real world data.  Large data set (over 34,000 observations)
  - Housing market price regression analysis
</p></details>
<br>

## Tableau Projects

### Geospatial Data

#### Interactive Mapping

<font class = "td1">Presidential Election Results Per County Per Year</font> | <a href="https://public.tableau.com/views/PresidentialElectionResultsPerCountyPerYear/Sheet1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Map]</a>
<details><summary>[click for more details]</summary><p>

  - Real world data. Large data set (over 72,000 observations)
  - An analysis of Presidential Election Data from 2000 through 2020
</p></details>
<br>

### Public Health Data

#### Tableau Storypoint & Dahsboard

<font class = "td1">National Health and Nutrition Examination Survey (NHANES)</font> | <a href="https://public.tableau.com/views/NHANESHealthData-Storypoint/SmokingHealth?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Storypoint]</a> & <a href="https://public.tableau.com/views/NHANESHealthData/SmokingDashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Dashboard]</a>
<details><summary>[click for more details]</summary><p>

  - Real world data. Medium data set (approx. 7,900 observations)
  - An analysis of NHANES data of Smoking, Blood Pressure, and Serum Cholesterol Levels
</p></details>
  
