Introduction

According to the definition proposed by the World Health Organization in 1970, “stroke is rapidly developing clinical signs of focal (or global) disturbance of cerebral function, with symptoms lasting 24 hours or longer, or leading to death, with no apparent cause other than of vascular origin”. Stroke has a huge public health burden, which is set to rise over future decades because of demographic transitions of populations, particularly in developing countries [3]. This places stroke high on the agenda of public health issues in the century and is an important area for public health research. This paper provides an overview of stroke in the century from a public health perspective and covers several areas of the disease related to the burden, epidemiology, and quality of life.

Primary Analysis Objectives

Determine whether age ,gender ,smoking_status ,hypertension, Average glucose leve and BMI are contributing factors for the occurrence of stroke

Data Sources

The patient data was obtained from Kaggle.The dataset consisted of 11 metrics for a total of 43,400 patients. These metrics included patients’ demographic data (gender, age, marital status, type of work and residence type) and health records (hypertension, heart disease, average glucose level measured after meal, Body Mass Index (BMI), smoking status and experience of stroke). from this variables we will have used gender , age , smoking status, bmi,and hypertension to predict the occurrence of stroke. Results were visualized and discovered insights were discussed. It is ended with a conclusion and some ideas were suggested for future work.

library(readr)
stroke <- read_csv("C:/Users/ma074675/Downloads/stroke.csv")
## 
## -- Column specification ------------------------------------
## cols(
##   id = col_double(),
##   gender = col_character(),
##   age = col_double(),
##   hypertension = col_double(),
##   heart_disease = col_double(),
##   ever_married = col_character(),
##   work_type = col_character(),
##   Residence_type = col_character(),
##   avg_glucose_level = col_double(),
##   bmi = col_character(),
##   smoking_status = col_character(),
##   stroke = col_double()
## )
stroke <- subset(stroke, select = c(gender, age, bmi, smoking_status, 
    stroke, hypertension, avg_glucose_level))

colSums(stroke == "N/A")
##            gender               age               bmi 
##                 0                 0               201 
##    smoking_status            stroke      hypertension 
##                 0                 0                 0 
## avg_glucose_level 
##                 0
stroke <- stroke %>%
    filter(bmi != "N/A", gender != "Other") %>%
    mutate(gender = factor(gender), hypertension = factor(hypertension), 
        bmi = as.numeric(bmi), smoking_status = factor(smoking_status), 
        stroke = factor(stroke), avg_glucose_level = as.numeric(avg_glucose_level))

Statistical Analysis

The data is available in .xlsx (excel) format. The data analysis was done using the statistical software R Version 4.0.2 and the project focuses mainly on multiple logistic regressions.For the preliminary investigation All potential predictor variables have been explored individually and the illustrations have been plotted throughout the entire dataset.the dataset was cleaned and all instance of with missing values have been removed.

Model Assumptions

All inferences were conducted using \(\alpha = 0.05\) unless stated otherwise. No adjustments for multiplicity were made as this is an exploratory analysis. categorical variables are summarized with proportions and frequencies. Continuous variables are summarized using mean, median, standard deviation, coefficient of variation, quantities,variance, maximum and minimum.

Analysis of variables

The amount of those who have had a stroke is a small portion of the population. This follows for the populations of different factors. Proportions of those who had a stroke in each factor will be compared.

p1 <- stroke %>%
    ggplot(aes(x = gender, fill = stroke)) + geom_bar()

p2 <- stroke %>%
    ggplot(aes(x = hypertension, fill = stroke)) + geom_bar()

p3 <- stroke %>%
    ggplot(aes(x = age, fill = stroke)) + geom_bar()

p4 <- stroke %>%
    ggplot(aes(x = bmi, fill = stroke)) + geom_bar()
p5 <- stroke %>%
    ggplot(aes(x = smoking_status, fill = stroke)) + geom_bar()
p6 <- stroke %>%
    ggplot(aes(x = avg_glucose_level, fill = stroke)) + geom_bar()

grid.arrange(grobs = list(p1, p2, p3, p4, p5, p6), ncol = 2, 
    top = "Count of Levels for Each Factor")
## Warning: position_stack requires non-overlapping x intervals

dat_prop <- stroke %>%
    group_by(gender) %>%
    summarise(prop = sum(stroke == "1")/length(gender))

p1p <- dat_prop %>%
    ggplot(aes(x = gender, y = prop)) + geom_col(fill = "#00BFC4")

dat_prop <- stroke %>%
    group_by(hypertension) %>%
    summarise(prop = sum(stroke == "1")/length(hypertension))

p2p <- dat_prop %>%
    ggplot(aes(x = hypertension, y = prop)) + geom_col(fill = "#00BFC4")


dat_prop <- stroke %>%
    group_by(smoking_status) %>%
    summarise(prop = sum(stroke == "1")/length(smoking_status))

p3p <- dat_prop %>%
    ggplot(aes(x = smoking_status, y = prop)) + geom_col(fill = "#00BFC4")




grid.arrange(grobs = list(p1p, p2p, p3p), ncol = 1, top = "Proportion of Strokes for Each Factor")

p1 <- stroke %>%
    ggplot(aes(x = age, fill = stroke)) + geom_density(alpha = 0.5) + 
    theme(legend.position = "none")

p2 <- stroke %>%
    ggplot(aes(x = bmi, fill = stroke)) + geom_density(alpha = 0.5) + 
    theme(legend.position = "none")

p3 <- stroke %>%
    ggplot(aes(x = age, fill = stroke)) + geom_histogram() + 
    theme(legend.position = "none")


p4 <- stroke %>%
    ggplot(aes(x = bmi, fill = stroke)) + geom_histogram() + 
    theme(legend.position = "none")

p5 <- stroke %>%
    ggplot(aes(x = avg_glucose_level, fill = stroke)) + geom_histogram() + 
    theme(legend.position = "none")

grid.arrange(grobs = list(p1, p2, p3, p4, p5), nrow = 2, top = "Distribution of Continuous Variables")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Gender doesnot appear to have much difference in the occurrence of strokes,population that are hypertensive have a higher proportion of the population had stroke.Current smokers have a higher proportion of their population having had a stroke than those who have never smoked with a difference of 1.41%. Former smokers have a higher occurrence of strokes than current smokers with a difference of 1.53%. Those with unknown smoking have a low occurrence of strokes at 2.47%.

From figure 3 we observed that as age increases the amount of strokes increases. when it comes to BMI there doesnot seem to be any difference between those who have had stroke and those whom havent yet had a stroke.

Result

Stroke vs Age

stroke1 <- glm(stroke ~ age, family = binomial, data = stroke)
summary(stroke1)
## 
## Call:
## glm(formula = stroke ~ age, family = binomial, data = stroke)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.716  -0.307  -0.164  -0.078   3.558  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.37687    0.36240   -20.4   <2e-16 ***
## age          0.07496    0.00532    14.1   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1728.3  on 4907  degrees of freedom
## Residual deviance: 1407.7  on 4906  degrees of freedom
## AIC: 1412
## 
## Number of Fisher Scoring iterations: 7
coef(stroke1)
## (Intercept)         age 
##   -7.376871    0.074964

\(logit(/pi^(x))=-7.3769+0.075x\), where x=age and Beta=0.075 in this equation. The interpretation for this model in a general for every 1-unit increase in age, there is a 0.075 increase in the log odds of O stroke. When we exponential this to the odds ratio as taking e^0.075= 0.793. Thus we interpret this as for every 1 unit increase in age, we expect a 1.07788 in the odds of stroke.Risk of stroke is more likely in increasing age. With this very small p-value and 0.075 as a coefficient , we can definitely reject the null that increasing age and stroke has no association.

xtable(stroke1, label = "tab:tab1", caption = c("Stoke vs Age"))
stroke2 <- glm(stroke ~ gender, family = binomial, data = stroke)
xtable(stroke2, label = "tab:tab2", caption = c("stoke vs Gender"))

\(logit(\pi^(x))=-3.1416+0.0691x\), where x=gender and Beta=0.691 in this equation.

For gender, even our Beta =o.0691 is positive but not that strong evidence to reject the null with this p-value=0.6287. The p value is not small enough to think there is an association in between gender and stroke.

logistic.display(stroke2)
## 
## Logistic regression predicting stroke : 1 vs 0 
##  
##                        OR(95%CI)         P(Wald's test)
## gender: Male vs Female 1.07 (0.81,1.42)  0.629         
##                                                        
##                        P(LR-test)
## gender: Male vs Female 0.629     
##                                  
## Log-likelihood = -864.033
## No. of observations = 4908
## AIC value = 1732.0661
strokeMod <- glm(stroke ~ hypertension, family = binomial(link = "logit"), 
    data = stroke)
xtable(strokeMod, label = "tab:tab3", caption = c("Stroke VS Hypertension"))
strokeWhtn <- data.frame(hypertension = "1")
htnPred <- predict(strokeMod, strokeWhtn, type = "response")
pctWhtn <- htnPred * 100
strokenohtn <- data.frame(hypertension = "0")
nohtnpred <- predict(strokeMod, strokenohtn, type = "response")
pctnohtn <- nohtnpred * 100

The intercept coefficient value is -3.364 which represents the odds of people without hypertension having a stroke.Hypertensive patients, within this group, have a 0.13304 probability of stroke which indicates a 13.304 % chance of having a stroke.This signals an increase in risk for patients diagnosed with hypertension.Patients without hypertension, within this group, have a 0.03343 probability of stroke which indicates a 3.343 % chance of having a stroke.Relatively good odds of avoiding a life altering event such as a stroke.

Stoke VS BMI

stroke4 <- glm(stroke ~ bmi, family = binomial(link = "logit"), 
    data = stroke)
xtable(stroke4, label = "tab:tab4", caption = c("Stroke VS BMI"))

\(logit(\pi^(x))=-3.1416+0.0691x\) , where x=BMI and Beta=0.0241 in this equation.

We can say with beta value even our p -value is not big , there is no evidence to reject the null and say BMI is correlated with stroke at this point.

Stoke VS Smoking status

xtable(stroke5, label = "tab:tab5", caption = c("Model with Predictors smoking status, smoked, neversmoked, unknown", 
    "ANOVA: smoked, neversmoked, unknown "))

Current smokers have a higher proportion of their population having had a stroke than those who have never smoked with a difference of 1.41%. Former smokers have a higher occurrence of strokes than current smokers with a difference of 1.53%. Those with unknown smoking have a low occurrence of strokes at 2.47%.

Stoke VS Glucose Level

The distribution for average glucose level is bimodal for both stroke and no stroke populations, with peaks at the same values. However the density of strokes at higher glucose levels is higher than the density of no strokes at the same level.

stroke6 <- glm(stroke ~ avg_glucose_level, family = binomial(link = "logit"), 
    data = stroke)
xtable(stroke6, label = "tab:tab6", caption = c("Stroke VS avg_glucose_level"))

Discussion and Conclusion

# Setting seed for replication
set.seed(2000)
# Creating training and test sets
split <- sort(sample(nrow(stroke), nrow(stroke) * 0.6))
stroke_train <- stroke[split, ]
stroke_test <- stroke[-split, ]

# full model
fullmodel <- glm(stroke ~ gender + age + smoking_status + hypertension + 
    bmi + avg_glucose_level, data = stroke_train, family = binomial(link = "logit"))


# reduced 1
reducedmodel <- glm(stroke ~ age + smoking_status + hypertension + 
    bmi + avg_glucose_level, data = stroke_train, family = binomial(link = "logit"))


# reduced 2

reducedmodel2 <- glm(stroke ~ age + smoking_status + hypertension + 
    avg_glucose_level, data = stroke_train, family = binomial(link = "logit"))


# reduced 3

reducedmodel3 <- glm(stroke ~ age + hypertension + avg_glucose_level, 
    data = stroke_train, family = binomial(link = "logit"))


# reduced 4

reducedmodel4 <- glm(stroke ~ age + I(smoking_status)^2 + hypertension + 
    avg_glucose_level, data = stroke_train, family = binomial(link = "logit"))



# reduced 5

reducedmodel5 <- glm(stroke ~ age + smoking_status + I(hypertension)^2 + 
    avg_glucose_level, data = stroke_train, family = binomial(link = "logit"))


# reduced 6

glm_stroke_bic <- glm(stroke ~ (age * bmi) + hypertension + avg_glucose_level, 
    data = stroke_train, family = binomial(link = "logit"))



test <- wald.test(b = coef(glm_stroke_bic), Sigma = vcov(glm_stroke_bic), 
    Terms = 1)

$T = -10.6251361 + 0.1143727(X_1) + 0.0954(X_2) + 0.7435(X_3) + 0.0038 (X_4) - 0.0014963 (x_5) $,

where x_1=age x_2=BMI x_3=hypertension x_4=avg_glucose_level x_4=age*BMI

Reduce model 6 is giving us one the small AIC so we would like to say this the best fit overall. even is not the smallest AIC is simpler than the other models so we can analysis better.

roc_pred <- prediction(predictions = glm_stroke_bic$fitted.values, 
    labels = stroke_train$stroke)

roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")

roc_curve <- data.frame(Spec = 1 - unlist(roc_perf@x.values), 
    Sens = unlist(roc_perf@y.values), thresh = unlist(roc_perf@alpha.values))

roc_curve$distance <- sqrt((1 - roc_curve$Spec)^2 + (1 - roc_curve$Sens)^2)

opt <- roc_curve %>%
    slice(distance %>%
        which.min())

plot(roc_perf, main = "Logistic Regression for Strokes")
abline(0, 1, col = "grey80")
# Optimal Threshold
abline(v = 1 - opt$Spec, col = "gray80")
abline(h = opt$Sens, col = "gray80")

glm_predict <- predict.glm(glm_stroke_bic, newdata = stroke_test, 
    type = "response", se.fit = FALSE) %>%
    as_tibble()
glm_predict$pred <- ifelse(glm_predict$value >= opt$thresh, 1, 
    0)
glm_predict <- glm_predict %>%
    mutate(pred = factor(pred))
confusionMatrix(data = glm_predict$pred, reference = stroke_test$stroke, 
    positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1375   16
##          1  499   74
##                                         
##                Accuracy : 0.738         
##                  95% CI : (0.718, 0.757)
##     No Information Rate : 0.954         
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.156         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.8222        
##             Specificity : 0.7337        
##          Pos Pred Value : 0.1291        
##          Neg Pred Value : 0.9885        
##              Prevalence : 0.0458        
##          Detection Rate : 0.0377        
##    Detection Prevalence : 0.2918        
##       Balanced Accuracy : 0.7780        
##                                         
##        'Positive' Class : 1             
## 

The model performs well with an accuracy of 73.8%. The sensitivity of 82.2% is good since we want to miss as little strokes as possible.

Improvements can be made if more data were to be collected. They also might be made if additional factors were collected, specifically ones which are considered to be risk factors by those with domain knowledge.

Week 7 – Summary

Literate statistical Programming

Literate programs is a single stream of human-readable text and machine-readable code.the code chunks are used to load and prepare the data,compute the results and create plots and tables. While the human readable text is used to describe the data, explain the analysis and present the result.

R markdown

R Markdown is a unified authoring framework that is reproducible which combines the code and the writer’s description. This document is designed to be used to communicate with stake holders, collaborate with other data scientists and as a notebook.

Shiny

Shiny is an open-source R package which combines the computational power of r with the interactivity of modern website building. Shiny provides a powerful web framework for building web applications using R. we can use shiny on a webpage or embed them in an R-markdown document.

Components of shiny

There are two components for shiny app, A user interface object and a server function. In a shiny app the use of the UI. R (User interface) is to control the layout and appearance of the app while the Server.R (server function) contains the instructions the users machine needs to build the app.