R Markdown Projects

Modeling, Cross Validation, & Machine Learning

Linear & Logistic Regression Modeling Using Machine learning Approaches

A Blind Taste Test | [Report]
Predicting sommelier quality rankings of wine and wine type (red/white) utilizing chemical composition alone [click for more details]

  • Real world data; medium data set (approx. 6,500 observations)
  • Regularized linear & logistic regression prediction models, support vector machine (SVM) regression and classification models
  • Coefficient path analysis, hyperparameter tuning, and optimal cut off probability analysis for optimized accuracy and specificity.
  • Analysis goals:
    • Identifying the best possible prediction model for sommelier wine quality ranking based on chemical composition
    • Creating a white vs red prediction model for blend wines that do not neatly fall into either category


Hold That Plane! | [Report]
An analisis of global flight delay data and prediction model creation to reduce delays and increase live prediction capabilities [click for more details]

  • Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  • Multiple Imputation by Chained Equations, Ruben’s rules, Recursive feature elimination, Principal Component Analysis, k-Means Clustering
  • Forward elimination model creation, cross validation, Receiver Operating Characteristic (ROC) comparison
  • Analysis goals:
    • Identifying the variables with greatest impact on overall flight delay for subsequent mitigation and preventative efforts
    • Live flight delay prediction based on available flight data prior to landing (predicting total delay time before flights depart)


Unsupervised Learning Algorithms and Predictive modeling

Big Back Behavior ] [Report]
Predicting obesity rankings based on behavioral survey results [click for more details]

  • Real world data; medium data set (approx 2,200 observations)
  • CART regression prediction models, CART classification prediction models, Bootstrap BAGGING regression prediction models, & Bootstrap BAGGING classification prediction models
  • Variable importance analysis, hyperparameter tuning, model pruning, ROC curve analysis & optimal cutoff identification
  • Analysis goals:
    • Develop overweight/obesity prediction model utilizing patient lifestyle and behavioral information, excluding weight
    • Develop weight independent prediction models for overweight/obesity to allow for prediction of those at risk based on behavioral & lifestyle choices


Can you pay me back? | [Report] & [Presentation]
Loan Default Borrower Data Clustering & Analysis [click for more details]

  • Report & supporting presentation
  • Synthetic data. Medium data set (approx. 1000 observations)
  • K means clustering, agglomerative hierarchical data clustering, principal component analysis, local outlier factor analysis


Blind Breast Biopsies | [Report]
Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete Biopsy [click for more details]

  • Synthetic data. Small data set (approx. 600 observations)
  • Logistic predictive modeling, single-layer neural network model (perceptron) prediction, decision tree algorithms, ROC analysis, and bagging


Regression Analysis & Model Creation

Can We Predict the Future in Health? | [Report]
Framingham Heart Study Model Fitting and Cross Validatiion for BMI Predictions [click for more details]

  • Real world data. Medium data set (approx. 4,200 observations)
  • Linear regression model creation, box-cox transformation, mean square error (MSE) cross validation, logistic regression model creation, receiver operating characteristic (ROC) and area under the curve (AUC) model comparison


Data Visualization and Exploratory Data Analysis

Exploratory Data Analysis & Variable Relationship Exploration

I’m Going to Miss my Flight! | [Report]
An analysis of flight delay times across airports [click for more details]

  • Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  • Analysis of variables impacting flight delays & solution implementation
  • ggplot & Plot_ly interactive visualizations


Exploratory Data Analysis

Presidential Fitness Test | [Report]
The Impact of Local Education, Poverty, and Unemployment Rates on County Presidential Election Results [click for more details]

  • Four real world data sets utilized. Large data sets (over 72,000 observations)
  • Relational data sets and data aggregation
  • GeoJSON & leaflet Interactive Map, ggplot


Time Series Data

The Price of Longevity | [Report]
An Analysis of Annual Income on Life Expectancy [click for more details]

  • Real world data. Large data set (over 35,000 observations)
  • An analysis of variables impacting average life expectancy
  • ggplot & Plot_ly interactive visualizations


SAS Projects

Linear Regression Analysis

Buisness Analytics Proposal

Pizzaz.com Speed Dating Comparability Prediction | [Proposal] & [Presentation] & [SAS Code]
Speed Dating Analytics & Online Dating Optimization [click for more details]

  • Synthetic data. Small data set (276 observations)
  • An analysis of variables impacting overall dater interest and linear regression prediction model creation
  • Proposal for data utilization and business improvement


Shiny/FlexDashboard Projects

Linear Regression Dashboard

Melbourne Housing Market | [R Shiny FlexDashboard]
[click for more details]

  • Real world data. Large data set (over 34,000 observations)
  • Housing market price regression analysis


Tableau Projects

Geospatial Data

Interactive Mapping

Presidential Election Results Per County Per Year | [Map]
[click for more details]

  • Real world data. Large data set (over 72,000 observations)
  • An analysis of Presidential Election Data from 2000 through 2020


Public Health Data

Tableau Storypoint & Dahsboard

National Health and Nutrition Examination Survey (NHANES) | [Storypoint] & [Dashboard]
[click for more details]

  • Real world data. Medium data set (approx. 7,900 observations)
  • An analysis of NHANES data of Smoking, Blood Pressure, and Serum Cholesterol Levels

---
title: "Projects and Presentations :<img src=\"https://nlepera.github.io/sta551/HW01/img/penguin_cute.png\" style=\"float: right; width: 12%\"/>"
subtitle: "Links to relevant coursework projects and presentations"
author:
- name: Natalie LePera
  affiliation: West Chester University | M.S. Applied Statistics, Data Science Concentraton
date: 
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    toc_collapse: yes
    number_sections: no
    code_folding: hide
    code_download: yes
    smooth_scroll: true
    theme: readable
    fig_align: center
    df_print: kable
---

```{css, echo = FALSE}
h1.title {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h1.subtitle {  /* Title - font specifications of the report title */
  font-weight:bold;
  color: darkmagenta ;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-family: system-ui;
  color: navy;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-family: system-ui;
  color: navy;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-weight:bold;
    color: navy;
    text-align: left;
    font-family: verdana;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    color: darkred;
    text-align: left;
    font-family: verdana;
}

body {
  background-color:white;
  font-family: verdana;
    }

.highlightme { 
  background-color:yellow; 
}

p { 
  background-color:white; 
    font-family: verdana;
}

h5 {
  color: navy;
  font-family: verdana;
}

.iframe {
  text-align: center;
}

a:link {
  color: black;
  font-family: verdana;
}

.figlabel {
  text-align: center;
  color: darkslategray;
  font-weight: bold;
  font-size: 18;
    font-family: verdana;
}

.td1 {
  font-weight: bold;
  font-family: verdana;
}

th, td {
  border-bottom: 1px solid #ddd;
  text-align: left;
}

tr:hover {background-color: coral;}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```



## R Markdown Projects

### Modeling, Cross Validation, & Machine Learning

#### Linear & Logistic Regression Modeling Using Machine learning Approaches

<font class = "td1">A Blind Taste Test</font> | <a href="https://rpubs.com/nlepera/sta552_HW04" target = "blank">[Report]</a>
<details><summary>Predicting sommelier quality rankings of wine and wine type (red/white) utilizing chemical composition alone [click for more details]</summary><p>

  - Real world data; medium data set (approx. 6,500 observations)
  - Regularized linear & logistic regression prediction models, support vector machine (SVM) regression and classification models
  - Coefficient path analysis, hyperparameter tuning, and optimal cut off probability analysis for optimized accuracy and specificity.
  - Analysis goals:
      - Identifying the best possible prediction model for sommelier wine quality ranking based on chemical composition
      - Creating a white vs red prediction model for blend wines that do not neatly fall into either category
</p></details>
<br>

  <font class = "td1"> Hold That Plane! </font> | <a href="https://rpubs.com/nlepera/sta552_HW03" target ="blank">[Report]</a> 
<details><summary>An analisis of global flight delay data and prediction model creation to reduce delays and increase live prediction capabilities [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Multiple Imputation by Chained Equations, Ruben's rules, Recursive feature elimination, Principal Component Analysis, k-Means Clustering
  - Forward elimination model creation, cross validation, Receiver Operating Characteristic (ROC) comparison
  - Analysis goals:
      - Identifying the variables with greatest impact on overall flight delay for subsequent mitigation and preventative efforts
      - Live flight delay prediction based on available flight data prior to landing (predicting total delay time before flights depart)
</p>
</details>
<br>

#### Unsupervised Learning Algorithms and Predictive modeling

<font class = "td1">Big Back Behavior</font> ] <a href="https://rpubs.com/nlepera/sta552_HW05" target = "blank">[Report]</a>
<details><summary>Predicting obesity rankings based on behavioral survey results [click for more details]</summary><p>

  - Real world data; medium data set (approx 2,200 observations)
  - CART regression prediction models, CART classification prediction models, Bootstrap BAGGING regression prediction models, & Bootstrap BAGGING classification prediction models
  - Variable importance analysis, hyperparameter tuning, model pruning, ROC curve analysis & optimal cutoff identification
  - Analysis goals:
    - Develop overweight/obesity prediction model utilizing patient lifestyle and behavioral information, <b>excluding</b> weight
    - Develop weight independent prediction models for overweight/obesity to allow for prediction of those at risk based on behavioral & lifestyle choices
</p></details>
<br>

<font class = "td1">Can you pay me back?</font> | <a href="https://rpubs.com/nlepera/sta551_hw04" target ="blank">[Report]</a> & <a href="https://nlepera.github.io/sta551/FINAL/" target ="blank">[Presentation]</a>
<details><summary>Loan Default Borrower Data Clustering & Analysis [click for more details]</summary><p>

  - Report & supporting presentation
  - Synthetic data. Medium data set (approx. 1000 observations)
  - K means clustering, agglomerative hierarchical data clustering, principal component analysis, local outlier factor analysis
</p></details>
<br>


<font class = "td1">Blind Breast Biopsies</font> | <a href="https://rpubs.com/nlepera/sta551_hw03" target ="blank">[Report]</a>
<details><summary>Predicting Breast Tissue Biopsy Diagnosis In the Event of Incomplete Biopsy [click for more details]</summary><p>

  - Synthetic data. Small data set (approx. 600 observations)
  - Logistic predictive modeling, single-layer neural network model (perceptron) prediction, decision tree algorithms, ROC analysis, and bagging
</p></details>
<br>

#### Regression Analysis & Model Creation

<font class = "td1">Can We Predict the Future in Health?</font> | <a href="https://rpubs.com/nlepera/sta551_hw02" target ="blank">[Report]</a>
<details><summary>Framingham Heart Study Model Fitting and Cross Validatiion for BMI Predictions [click for more details]</summary><p>

  - Real world data. Medium data set (approx. 4,200 observations)
  - Linear regression model creation, box-cox transformation, mean square error (MSE) cross validation, logistic regression model creation, receiver operating characteristic (ROC) and area under the curve (AUC) model comparison
</p></details>
<br>


### Data Visualization and Exploratory Data Analysis

#### Exploratory Data Analysis & Variable Relationship Exploration

<font class = "td1">I'm Going to Miss my Flight!</font> | <a href="https://nlepera.github.io/sta552/HW01/index.html" target ="blank">[Report]</a>
<details><summary>An analysis of flight delay times across airports [click for more details]</summary><p>

  - Synthetic data with forced missing values. Medium data set (over 3,500 observations)
  - Analysis of variables impacting flight delays & solution implementation
  - ggplot & Plot_ly interactive visualizations 
</p></details>
<br>

#### Exploratory Data Analysis

<font class = "td1">Presidential Fitness Test</font> | <a href="https://rpubs.com/nlepera/sta551_hw01" target ="blank">[Report]</a>
<details><summary>The Impact of Local Education, Poverty, and Unemployment Rates on County Presidential Election Results [click for more details]</summary><p>

  - Four real world data sets utilized. Large data sets (over 72,000 observations)
  - Relational data sets and data aggregation
  - GeoJSON & leaflet Interactive Map, ggplot
</p></details>
<br>

#### Time Series Data

<font class = "td1">The Price of Longevity</font> | <a href="https://nlepera.github.io/sta553/w06_plotly/" target ="blank">[Report]</a>
<details><summary>An Analysis of Annual Income on Life Expectancy [click for more details]</summary><p>

  - Real world data. Large data set (over 35,000 observations)
  - An analysis of variables impacting average life expectancy
  - ggplot & Plot_ly interactive visualizations
</p></details>
<br>
  
## SAS Projects

### Linear Regression Analysis 

#### Buisness Analytics Proposal

<font class = "td1">Pizzaz.com Speed Dating Comparability Prediction</font> | <a href="https://nlepera.github.io/sta511/LePera_N_Exam%202_2025.04.22.pdf" target ="blank">[Proposal]</a> & <a href="https://nlepera.github.io/sta511/exam2_pdf.pdf" target ="blank">[Presentation]</a> & <a href = "https://nlepera.github.io/sta511/exam2_SAS.pdf" target="blank">[SAS Code]</a>
<details><summary>Speed Dating Analytics & Online Dating Optimization [click for more details]</summary><p>

  - Synthetic data. Small data set (276 observations)
  - An analysis of variables impacting overall dater interest and linear regression prediction model creation
  - Proposal for data utilization and business improvement
</p></details>
<br>
  
  
## Shiny/FlexDashboard Projects

#### Linear Regression Dashboard

<font class = "td1">Melbourne Housing Market</font> | <a href="https://nlepera.shinyapps.io/final/" target ="blank">[R Shiny FlexDashboard]</a>
<details><summary>[click for more details]</summary><p>

  - Real world data.  Large data set (over 34,000 observations)
  - Housing market price regression analysis
</p></details>
<br>

## Tableau Projects

### Geospatial Data

#### Interactive Mapping

<font class = "td1">Presidential Election Results Per County Per Year</font> | <a href="https://public.tableau.com/views/PresidentialElectionResultsPerCountyPerYear/Sheet1?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Map]</a>
<details><summary>[click for more details]</summary><p>

  - Real world data. Large data set (over 72,000 observations)
  - An analysis of Presidential Election Data from 2000 through 2020
</p></details>
<br>

### Public Health Data

#### Tableau Storypoint & Dahsboard

<font class = "td1">National Health and Nutrition Examination Survey (NHANES)</font> | <a href="https://public.tableau.com/views/NHANESHealthData-Storypoint/SmokingHealth?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Storypoint]</a> & <a href="https://public.tableau.com/views/NHANESHealthData/SmokingDashboard?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link" target ="blank">[Dashboard]</a>
<details><summary>[click for more details]</summary><p>

  - Real world data. Medium data set (approx. 7,900 observations)
  - An analysis of NHANES data of Smoking, Blood Pressure, and Serum Cholesterol Levels
</p></details>
  
