About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Fri Feb 28 21:09:32 2025.

Data Description:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Load data and install packages

```{r} # Install necessary packages install.packages(“tidyverse”) install.packages(“caret”) install.packages(“randomForest”) install.packages(“pROC”)

Load libraries

library(tidyverse) library(caret) library(randomForest) library(pROC)

Load the dataset

stroke_data <- read.csv(“stroke_data.csv”)

Display the first few rows of the dataset

head(stroke_data)

Summary statistics

summary(stroke_data)

Check for missing values

colSums(is.na(stroke_data))

Data preprocessing: Handle missing values, encode categorical variables, etc.

Example: Replace missing values in the ‘bmi’ column with the mean

stroke_data\(bmi[is.na(stroke_data\)bmi)] <- mean(stroke_data$bmi, na.rm = TRUE)

Convert categorical variables to factors

stroke_data\(gender <- as.factor(stroke_data\)gender) stroke_data\(ever_married <- as.factor(stroke_data\)ever_married) stroke_data\(work_type <- as.factor(stroke_data\)work_type) stroke_data\(Residence_type <- as.factor(stroke_data\)Residence_type) stroke_data\(smoking_status <- as.factor(stroke_data\)smoking_status) stroke_data\(stroke <- as.factor(stroke_data\)stroke)

Check the structure of the dataset after preprocessing

str(stroke_data)

Split the data into training and testing sets

set.seed(123) train_index <- createDataPartition(stroke_data$stroke, p = 0.8, list = FALSE) train_data <- stroke_data[train_index, ] test_data <- stroke_data[-train_index, ]

Build a Random Forest model

rf_model <- randomForest(stroke ~ ., data = train_data, ntree = 100, importance = TRUE)

Display the model summary

print(rf_model)

Make predictions on the test set

predictions <- predict(rf_model, test_data, type = “class”)

Confusion matrix

confusionMatrix(predictions, test_data$stroke)

ROC curve and AUC

roc_curve <- roc(test_data$stroke, as.numeric(predictions)) plot(roc_curve, main = “ROC Curve”) auc(roc_curve)

Save the trained model for deployment

saveRDS(rf_model, “stroke_prediction_model.rds”)

Load the model (for demonstration purposes)

loaded_model <- readRDS(“stroke_prediction_model.rds”)

Example prediction using the loaded model

example_data <- test_data[1, -which(names(test_data) == “stroke”)] predicted_stroke <- predict(loaded_model, example_data) print(predicted_stroke)

Summary of findings

cat(“The Random Forest model achieved an accuracy of”, confusionMatrix(predictions, test_data\(stroke)\)overall[‘Accuracy’], “on the test dataset.”)

cat(“The AUC for the model is”, auc(roc_curve), “indicating good predictive performance.”)

Conclusions

cat(“The stroke prediction model successfully predicts the likelihood of stroke based on patient attributes. The model can be deployed in healthcare systems to assist in early stroke detection and prevention.”)

Explanation of the Code:

  1. Task One: Import Data and Data Preprocessing:
    • Load the dataset and install necessary packages.
    • Perform data exploration, handle missing values, and encode categorical variables.
  2. Task Two: Build Prediction Models:
    • Split the data into training and testing sets.
    • Build a Random Forest model for stroke prediction.
  3. Task Three: Evaluate and Select Prediction Models:
    • Evaluate the model using a confusion matrix and ROC curve.
    • Calculate the AUC to assess model performance.
  4. Task Four: Deploy the Prediction Model:
    • Save the trained model for deployment.
    • Demonstrate loading the model and making predictions.
  5. Task Five: Findings and Conclusions:
    • Summarize the model’s performance and provide conclusions.

This RMarkdown file provides a complete workflow for building, evaluating, and deploying a stroke prediction model in R. You can replace "stroke_data.csv" with the actual path to your dataset.