This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Fri Feb 28 21:09:32 2025.
Data Description:
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
```{r} # Install necessary packages install.packages(“tidyverse”) install.packages(“caret”) install.packages(“randomForest”) install.packages(“pROC”)
library(tidyverse) library(caret) library(randomForest) library(pROC)
stroke_data <- read.csv(“stroke_data.csv”)
head(stroke_data)
summary(stroke_data)
colSums(is.na(stroke_data))
stroke_data\(bmi[is.na(stroke_data\)bmi)] <- mean(stroke_data$bmi, na.rm = TRUE)
stroke_data\(gender <- as.factor(stroke_data\)gender) stroke_data\(ever_married <- as.factor(stroke_data\)ever_married) stroke_data\(work_type <- as.factor(stroke_data\)work_type) stroke_data\(Residence_type <- as.factor(stroke_data\)Residence_type) stroke_data\(smoking_status <- as.factor(stroke_data\)smoking_status) stroke_data\(stroke <- as.factor(stroke_data\)stroke)
str(stroke_data)
set.seed(123) train_index <- createDataPartition(stroke_data$stroke, p = 0.8, list = FALSE) train_data <- stroke_data[train_index, ] test_data <- stroke_data[-train_index, ]
rf_model <- randomForest(stroke ~ ., data = train_data, ntree = 100, importance = TRUE)
print(rf_model)
predictions <- predict(rf_model, test_data, type = “class”)
confusionMatrix(predictions, test_data$stroke)
roc_curve <- roc(test_data$stroke, as.numeric(predictions)) plot(roc_curve, main = “ROC Curve”) auc(roc_curve)
saveRDS(rf_model, “stroke_prediction_model.rds”)
loaded_model <- readRDS(“stroke_prediction_model.rds”)
example_data <- test_data[1, -which(names(test_data) == “stroke”)] predicted_stroke <- predict(loaded_model, example_data) print(predicted_stroke)
cat(“The Random Forest model achieved an accuracy of”, confusionMatrix(predictions, test_data\(stroke)\)overall[‘Accuracy’], “on the test dataset.”)
cat(“The AUC for the model is”, auc(roc_curve), “indicating good predictive performance.”)
cat(“The stroke prediction model successfully predicts the likelihood of stroke based on patient attributes. The model can be deployed in healthcare systems to assist in early stroke detection and prevention.”)
This RMarkdown file provides a complete workflow for building,
evaluating, and deploying a stroke prediction model in R. You can
replace "stroke_data.csv" with the actual path to your
dataset.