This report provides an overview of the initial exploratory data analysis (EDA) conducted on the dataset. The goal of this project is to create a prediction algorithm and a Shiny app to predict the outcome based on the data. This report includes a summary of the dataset, key insights gathered so far, and plans for further development.
# Load necessary libraries
library(dplyr)
library(ggplot2)
# Load the dataset (replace with your actual data file)
data <- read.csv("example_data.csv")
# Display the structure of the dataset
str(data)
## 'data.frame': 10 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10
## $ Age : int 25 30 22 28 35 40 45 50 32 27
## $ Height : int 175 180 160 170 165 175 160 185 170 165
## $ Weight : int 70 85 55 68 75 80 60 90 72 65
## $ Outcome: int 1 0 1 0 1 0 1 0 1 0
The table below shows a summary of key statistics for the dataset.
summary(data)
## ID Age Height Weight Outcome
## Min. : 1.00 Min. :22.00 Min. :160.0 Min. :55.00 Min. :0.0
## 1st Qu.: 3.25 1st Qu.:27.25 1st Qu.:165.0 1st Qu.:65.75 1st Qu.:0.0
## Median : 5.50 Median :31.00 Median :170.0 Median :71.00 Median :0.5
## Mean : 5.50 Mean :33.40 Mean :170.5 Mean :72.00 Mean :0.5
## 3rd Qu.: 7.75 3rd Qu.:38.75 3rd Qu.:175.0 3rd Qu.:78.75 3rd Qu.:1.0
## Max. :10.00 Max. :50.00 Max. :185.0 Max. :90.00 Max. :1.0
Here are some key plots that illustrate important features of the data:
# Distribution of Age
ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
labs(title = "Distribution of Age", x = "Age (years)", y = "Count")
# Scatter Plot of Height vs Weight
ggplot(data, aes(x = Height, y = Weight)) +
geom_point(color = "red") +
labs(title = "Scatter Plot of Height vs Weight", x = "Height (cm)", y = "Weight (kg)")
# Bar Plot of Outcome
ggplot(data, aes(x = factor(Outcome))) +
geom_bar(fill = "green", color = "black") +
labs(title = "Distribution of Outcomes", x = "Outcome", y = "Count")
Age Distribution: The age distribution appears to be fairly even across different age groups. Height vs Weight: There is a positive correlation between height and weight, as expected. Outcome Distribution: The outcomes are balanced, with nearly equal counts for both 0 and 1, making this a good dataset for binary classification.
Approach: A logistic regression model will be used to predict the binary outcome based on features like Age, Height, and Weight. Feature Selection: Initial features will include Age, Height, and Weight, with further feature engineering if necessary. Model Evaluation: The model will be evaluated using accuracy, precision, recall, and the area under the ROC curve (AUC).
App Functionality: The Shiny app will allow users to input values for Age, Height, and Weight and receive a prediction for the outcome. User Interface: The UI will be simple and intuitive, with sliders for input and a display for the predicted outcome. Deployment: The app will be deployed on RStudio’s Shiny servers for easy access.
This report summarizes the initial findings from the exploratory data analysis and outlines the next steps in developing a prediction algorithm and Shiny app. Feedback on the approach and findings is welcome to guide the development process.