Read in your dataset and libraries

setwd("~/Data 101")
stroke_df <- read.csv("stroke.csv")
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(ggplot2)
library(psych)
## Warning: package 'psych' was built under R version 4.1.3
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(corrplot)
## corrplot 0.92 loaded
library(RColorBrewer)
library(dslabs)
## Warning: package 'dslabs' was built under R version 4.1.3
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.1.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'highcharter'
## The following object is masked from 'package:dslabs':
## 
##     stars
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.1.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.3
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

Dataset summary

###This dataset explores data of patients internal and external characteristics and their stroke status. It contains 5110 observations with 12 attributes. It is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

The variables in this dataset include:

id: unique identifier

gender: “Male”, “Female” or “Other”

age: age of the patient

hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension

heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease

ever_married: “No” or “Yes”

work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”

Residence_type: “Rural” or “Urban”

avg_glucose_level: average glucose level in blood

bmi: body mass index

smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*

stroke: 1 if the patient had a stroke or 0 if not

Note: “Unknown” in smoking_status means that the information is unavailable for this patient.

a. the number of missing values in your dataset

b. the percentage of missing values in your dataset

sum(is.na(stroke_df))
## [1] 201
mean(is.na(stroke_df))
## [1] 0.003277886

Converting all missing values to NA

stroke_df[stroke_df == ""]<-NA

Viewing the dataset

view(stroke_df)

Questions

Does one gender get strokes more than the other?

How does marriage effect stroke outcome?

Is there any correlation between smoking status & stroke?

Are there any unusual patterns?

Cleaning the data and converting all of the binary variables on our data into “yes” and “no”

stroke1_df<- stroke_df%>% 
  mutate(hypertension=ifelse(hypertension==0,"No","Yes")) %>% 
  mutate(heart_disease=ifelse(heart_disease==0,"No","Yes")) %>% 
  mutate(stroke=ifelse(stroke==0,"No","Yes"))

Graph about the effect of gender and marriage on the outcome of stroke

Gender<-stroke1_df%>% 
  ggplot(aes(stroke))+
  geom_bar(aes(fill=gender),position = "dodge")+
  theme_minimal()+
  ylab("Count")+
  xlab("Stroke")+
  scale_fill_brewer(palette = "Set1")+
  ggtitle("Stroke by Gender")

Marriage<-stroke1_df %>% 
  ggplot(aes(stroke))+
  geom_bar(aes(fill=ever_married),position = "dodge")+
  theme_minimal()+
  ylab("Count")+
  xlab("Stroke")+scale_fill_brewer(palette = "Set2")+
  ggtitle("Stroke by Marriage Status")
ggarrange(Gender,Marriage)

## Based on the graph above, we can observe that:

Female individuals were more likely to have a stroke than men. However, since most of the data targets non-stroke individuals, I think it’s important to focus on those stats. In this graph there are more women who didn’t have strokes than men who didn’t have strokes.

Individuals who had a stroke is also mostly comprised of females and people who were married.

A graph of smoking status and it’s effect on strokes

Smoke<-stroke1_df %>% 
  ggplot(aes(stroke))+
  geom_bar(aes(fill=smoking_status),position = "dodge")+
  theme_minimal()+
  ylab("Count")+
  xlab("Stroke")+scale_fill_brewer(palette = "Set2")+
  ggtitle("Stroke by Smoking Status")
ggarrange(Smoke)

Based on this graph, people who never smoked had a stroke more than people who did smoke before or currently smokes. However, since this data is comprised of mostly non-smoking individuals, it is important to look at those statistics and people who never smoked, were more likely to not have a stroke. People who formerly smoked were second in regard to non-stroke individuals. A stroke can happen in two ways: a blood clot or plaque that blocks a blood vessel, or a blood vessel in the brain breaks or ruptures. Smoking doubles your risk of stroke. Smoking increases blood pressure and reduces oxygen in the blood. High blood pressure is a significant risk factor. Tobacco smoke contains over 4000 toxic chemicals deposited on the lungs or absorbed into the bloodstream damaging blood vessels. Smoking also makes blood stickier, which can lead to blood clots, resulting in the likelihood of stroke.

Graph depicting stroke based on work type and residence type

f<-stroke1_df %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=work_type),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Work Type")

g<-stroke1_df %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=Residence_type),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Residence Type")
ggarrange(f,g,ncol = 1)

According to the graph above, people who worked as a private work type were more likely to have a stroke than self employed and government workers. Stroke by residence type is almost evenly distributed. However, surprisingly, private workers were also less likely to have strokes and the reason for this could be attributed to the benefits of private-sector employment, which typically include larger salaries, more opportunities for advancement, and better benefits in the form of insurance coverage, vacation time, and annual bonuses. Work-life balance is very important and work can be people’s main source of stress, which is one of the main causes of stroke.

Filtering data for only people who had strokes

withstroke <- stroke_df %>% filter(stroke==1)

Stroke Incidence by Age

withstroke %>% ggplot(aes(age, fill=gender)) + geom_density(alpha=0.2) + ggtitle("Stroke by Age in Male and Female")

In Conclusion, looking at the chart, Age plays a vital factor in the incidence of stroke. As age of the participants progresses so is the prevalence of stroke. Women tend to be prone at an earlier age while more men develop the illness over time.

withstroke %>% ggplot(aes(avg_glucose_level, fill=gender)) + geom_density(alpha=0.2) + ggtitle("Stroke and Glucose Level by Gender")

More participants with lower glucose level developed stroke and mostly were women while the opposite spectrum of elevated glucose levels that developed the illness were men.

In conclusion, keeping one’s glucose level at normal level will decrease the chance of developing the illness over time.

According to research, each year, about 55,000 more American women have a stroke than men. In addition to the general risk factors for stroke like family history, smoking, high cholesterol, high blood pressure, being overweight, and lack of exercise, women are faced with a set of unique risk factors that could increase their risk of stroke such as longer average life span; women typically live longer than men and stroke risk is strongly associated with advancing age. Oral contraceptives like hormonal birth control pills can increase the risk of stroke, especially when coupled with other high-risk factors like smoking and diabetes. Lastly, migraines: The majority of migraine sufferers in the U.S. are women, and migraines with aura (visual disturbances) can increase a woman’s risk of stroke.

Two sample mean test

Clearing out all the NA values and replacing them with 0

stroke1_df$bmi[is.na(stroke1_df$bmi)] <- 0

Is it true that on average, bmi is lower in men than women?

H0:μm=μb The average glucose level for males and females are equal

Hα:μb<μm The average glucose level is less in males versus females

Male <- subset(stroke1_df, gender == "Male")
Female <- subset(stroke1_df, gender== "Female")

t.test(Male$bmi, Female$bmi, alternative = "less", conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  Male$bmi and Female$bmi
## t = -3.268, df = 4532.5, p-value = 0.0005455
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.4393942
## sample estimates:
## mean of x mean of y 
##  27.23924  28.12408

P-value: 0.0005455 < 0.05

A real-world conclusion for the results of the hypothesis test.

Conclusion: Reject the Null Hypothesis H0:μm=μb

We must reject the null hypothesis and say that the mean bmi in men is less than women’s mean bmi.

Create a visualization that shows the distribution of the two samples.

boxplot(data = stroke1_df, bmi ~ gender, outline = F, col = "blue")

A real-world conclusion for the results of the confidence interval.

We are 95% confident that the mean bmi for men is .04842283 less than women bmi.

Plots of the boxplots for all of the quantitative variables based on the stroke.

x<-stroke1_df %>% ggplot(aes(age))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Age")
y<-stroke1_df %>% ggplot(aes(avg_glucose_level))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Average Glucose Level")
z<-stroke1_df %>% ggplot(aes(bmi))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Body Mass Index")
ggarrange(x,y,z,common.legend = T)

From the boxplots above we can observe that people who ever had a stroke has older age and higher glucose level than the people who haven’t.

Convert all columns to numeric type and store in a temporary data frame before push into correlation function

df <- data.frame(data.matrix(stroke1_df))

Confirm the data type of all columns in the temporary dataframe have been successfully converted to numeric type.

sapply(stroke1_df, class)
##                id            gender               age      hypertension 
##         "integer"       "character"         "numeric"       "character" 
##     heart_disease      ever_married         work_type    Residence_type 
##       "character"       "character"       "character"       "character" 
## avg_glucose_level               bmi    smoking_status            stroke 
##         "numeric"         "numeric"       "character"       "character"

Check the correlation of the dataset (The target output columns should be Stroke)

res <- cor(df)
round(res, 2)
##                      id gender   age hypertension heart_disease ever_married
## id                 1.00   0.00  0.00         0.00          0.00         0.01
## gender             0.00   1.00 -0.03         0.02          0.09        -0.03
## age                0.00  -0.03  1.00         0.28          0.26         0.68
## hypertension       0.00   0.02  0.28         1.00          0.11         0.16
## heart_disease      0.00   0.09  0.26         0.11          1.00         0.11
## ever_married       0.01  -0.03  0.68         0.16          0.11         1.00
## work_type          0.01  -0.07  0.54         0.13          0.10         0.43
## Residence_type     0.00  -0.01  0.01        -0.01          0.00         0.01
## avg_glucose_level  0.00   0.06  0.24         0.17          0.16         0.16
## bmi                0.08  -0.05  0.22         0.07         -0.03         0.25
## smoking_status    -0.02   0.04 -0.38        -0.13         -0.06        -0.30
## stroke             0.01   0.01  0.25         0.13          0.13         0.11
##                   work_type Residence_type avg_glucose_level   bmi
## id                     0.01           0.00              0.00  0.08
## gender                -0.07          -0.01              0.06 -0.05
## age                    0.54           0.01              0.24  0.22
## hypertension           0.13          -0.01              0.17  0.07
## heart_disease          0.10           0.00              0.16 -0.03
## ever_married           0.43           0.01              0.16  0.25
## work_type              1.00           0.00              0.09  0.25
## Residence_type         0.00           1.00              0.00  0.00
## avg_glucose_level      0.09           0.00              1.00  0.08
## bmi                    0.25           0.00              0.08  1.00
## smoking_status        -0.34           0.00             -0.10 -0.19
## stroke                 0.08           0.02              0.13 -0.05
##                   smoking_status stroke
## id                         -0.02   0.01
## gender                      0.04   0.01
## age                        -0.38   0.25
## hypertension               -0.13   0.13
## heart_disease              -0.06   0.13
## ever_married               -0.30   0.11
## work_type                  -0.34   0.08
## Residence_type              0.00   0.02
## avg_glucose_level          -0.10   0.13
## bmi                        -0.19  -0.05
## smoking_status              1.00  -0.07
## stroke                     -0.07   1.00

Correlation of the data

corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

Based on the plot, we can conclude, such as:

1. The “Ever_Married” column correlateswith hypertension, avg glucose level, heart disease, and stroke, bmi, work type, and age.

2. The “age” column negatively correlates with every column except residence type and ID.

3. The “Stroke” column has a positive correlation with hypertension, heart disease, and average glucose level.

Give a fourth question about your dataset that you can investigate using a Chi-Square test.

Does hypertension and stroke correlate?

I assume that hypertension will correlate with stroke.

Chi-Square Test

We can use a Chi-Square test with two categorical variables to determine if they are independent or related. The null hypothesis is always that the variables are independent (no relationship). We cannot determine what the relationship is if the null hypothesis rejected for the alternative hypothesis that the variables are related.

heart <- stroke1_df %>% count(hypertension, stroke)
p <- ggplot(data = heart)
p + geom_tile(mapping = aes(x=hypertension, y=stroke, fill = n))

H0: hypertension and stroke are independent

Hα: hypertension and stroke are dependent

heart_table <- table(stroke1_df$hypertension, stroke1_df$stroke)
heart_table
##      
##         No  Yes
##   No  4429  183
##   Yes  432   66
result <- chisq.test(heart_table); result
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  heart_table
## X-squared = 81.605, df = 1, p-value < 2.2e-16

P-value:< 0.00000000000000022

Conclusion: Reject the Null Hypothesis H0: work type and smoking status are independent. There is enough evidence to show that work type and smoking status are not related.

Some concerns about my dataset that I have is ethical challenges with inappropriate data collection. One problem I ran into was sampling problems. For some reason, a lot of the individuals in the dataset didn’t even have a stroke, so that really limited the finding and the flexibility I had with the dataset. To me, it was bad data collection. I think the population should’ve been half-stroke individuals and non-stroke individuals or leaned more towards individuals who had a stroke because that’s the purpose of the data; to see the causes of stroke. I am also concerned with data hygiene and data relevance; looking at the data, I don’t think that all of the data is relevant, such as work type and residence type. I feel like they have no real correlation with the data. Also, data hygiene isn’t up to par because for example, of smoking status; there are a lot of individuals with ”unknown” smoking status which means that this data wasn’t known. This concerns me about the data protocols and collection because why wouldn’t someone want to let the collectors know whether they smoked or not? Were they not given a written statement for data privacy? Concerning validation and testing of data models & analytics, data Validation testing is a process that allows the user to check that the provided data, they deal with, is valid or complete. Data validation testing is responsible for validating data and databases successfully through any needed transformations without loss. For some reason, there is unknown data in the dataset, so it leaves those variables invalid regarding the dataset as a whole.