Introduction

This report provides an overview of the initial exploratory data analysis (EDA) conducted on the dataset. The goal of this project is to create a prediction algorithm and a Shiny app to predict the outcome based on the data. This report includes a summary of the dataset, key insights gathered so far, and plans for further development.

Data Loading

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Load the dataset (replace with your actual data file)
data <- read.csv("example_data.csv")

# Display the structure of the dataset
str(data)

## 'data.frame':    10 obs. of  5 variables:
##  $ ID     : int  1 2 3 4 5 6 7 8 9 10
##  $ Age    : int  25 30 22 28 35 40 45 50 32 27
##  $ Height : int  175 180 160 170 165 175 160 185 170 165
##  $ Weight : int  70 85 55 68 75 80 60 90 72 65
##  $ Outcome: int  1 0 1 0 1 0 1 0 1 0

Summary Statistics

The table below shows a summary of key statistics for the dataset.

summary(data)

##        ID             Age            Height          Weight         Outcome   
##  Min.   : 1.00   Min.   :22.00   Min.   :160.0   Min.   :55.00   Min.   :0.0  
##  1st Qu.: 3.25   1st Qu.:27.25   1st Qu.:165.0   1st Qu.:65.75   1st Qu.:0.0  
##  Median : 5.50   Median :31.00   Median :170.0   Median :71.00   Median :0.5  
##  Mean   : 5.50   Mean   :33.40   Mean   :170.5   Mean   :72.00   Mean   :0.5  
##  3rd Qu.: 7.75   3rd Qu.:38.75   3rd Qu.:175.0   3rd Qu.:78.75   3rd Qu.:1.0  
##  Max.   :10.00   Max.   :50.00   Max.   :185.0   Max.   :90.00   Max.   :1.0

Exploratory Data Analysis

Here are some key plots that illustrate important features of the data:

# Distribution of Age
ggplot(data, aes(x = Age)) + 
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  labs(title = "Distribution of Age", x = "Age (years)", y = "Count")

# Scatter Plot of Height vs Weight
ggplot(data, aes(x = Height, y = Weight)) + 
  geom_point(color = "red") +
  labs(title = "Scatter Plot of Height vs Weight", x = "Height (cm)", y = "Weight (kg)")

# Bar Plot of Outcome
ggplot(data, aes(x = factor(Outcome))) + 
  geom_bar(fill = "green", color = "black") +
  labs(title = "Distribution of Outcomes", x = "Outcome", y = "Count")

Interesting Findings

Age Distribution: The age distribution appears to be fairly even across different age groups. Height vs Weight: There is a positive correlation between height and weight, as expected. Outcome Distribution: The outcomes are balanced, with nearly equal counts for both 0 and 1, making this a good dataset for binary classification.

Plans for Prediction Algorithm and Shiny App

Prediction Algorithm

Approach: A logistic regression model will be used to predict the binary outcome based on features like Age, Height, and Weight. Feature Selection: Initial features will include Age, Height, and Weight, with further feature engineering if necessary. Model Evaluation: The model will be evaluated using accuracy, precision, recall, and the area under the ROC curve (AUC).

Shiny App

App Functionality: The Shiny app will allow users to input values for Age, Height, and Weight and receive a prediction for the outcome. User Interface: The UI will be simple and intuitive, with sliders for input and a display for the predicted outcome. Deployment: The app will be deployed on RStudio’s Shiny servers for easy access.

Conclusion

This report summarizes the initial findings from the exploratory data analysis and outlines the next steps in developing a prediction algorithm and Shiny app. Feedback on the approach and findings is welcome to guide the development process.

Exploratory Data Analysis and Project Plan

Elhasnaoui

2024-08-24