Hi !! My Name is Caesar Welcome to my Rmd :) in this LBB i will use data healthcare-dataset-stroke-data.csv from https://www.kaggle.com. I hope you enjoy it !
Make sure our data placed in the same folder our R project data.
stroke <- read.csv("data_input/healthcare-dataset-stroke-data.csv")
Input data is DONE ! then let’s get started
head(stroke)
tail(stroke)
dim(stroke)
## [1] 5110 12
names(stroke)
## [1] "id" "gender" "age"
## [4] "hypertension" "heart_disease" "ever_married"
## [7] "work_type" "Residence_type" "avg_glucose_level"
## [10] "bmi" "smoking_status" "stroke"
From our inspection we can conclude :
* stroke data contain 5110 of rows and 12 of coloumns
* Each of column name : “id”, “gender”, “age”, “hypertension”, “heart_disease”,“ever_married”, “work_type”, “Residence_type”
“avg_glucose_level”, “bmi”, “smoking_status”, “stroke”
Check data type for each column
str(stroke)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 36.6 11 32.5 34.4 24 29 27.4 22.8 24 24.2 ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)
stroke$gender <- as.factor(stroke$gender)
stroke$ever_married <- as.factor(stroke$ever_married)
stroke$work_type <- as.factor(stroke$work_type)
stroke$Residence_type <- as.factor(stroke$Residence_type)
stroke$smoking_status <- as.factor(stroke$smoking_status)
str(stroke)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 1 1 ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ work_type : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
## $ Residence_type : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 36.6 11 32.5 34.4 24 29 27.4 22.8 24 24.2 ...
## $ smoking_status : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
Each of column already changed into desired data type
Cek for missing value
colSums(is.na(stroke))
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
anyNA(stroke)
## [1] FALSE
Great!! No missing value
Now, Stroke dataset is ready to be processed and analyzed
Brief explanation
summary(stroke)
## id gender age hypertension
## Min. : 67 Female:2994 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Male :2116 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 No :1757 children : 687 Rural:2514
## 1st Qu.:0.00000 Yes:3353 Govt_job : 657 Urban:2596
## Median :0.00000 Never_worked : 22
## Mean :0.05401 Private :2925
## 3rd Qu.:0.00000 Self-employed: 819
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Min. :10.30 formerly smoked: 885 Min. :0.00000
## 1st Qu.: 77.25 1st Qu.:23.50 never smoked :1892 1st Qu.:0.00000
## Median : 91.89 Median :28.00 smokes : 789 Median :0.00000
## Mean :106.15 Mean :29.08 Unknown :1544 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:33.20 3rd Qu.:0.00000
## Max. :271.74 Max. :97.60 Max. :1.00000
Summary :
1. In this data, it is known that there are 2116 male sex and 2994 female sex
2. The average age is 43.23 years, with the youngest age of 0.08 years and the oldest age of 82 years
3. From 5110 data, there were 3353 people who were married and 1757 people who were not married
4. private is the most work type with 2925 data
5. there are 2514 people living in rural and 56 people living in urban
6. average glucose level : 106.15, min: 55.12, and max: 271.74
7. average BMI : 29.08, min: 10.30, and max: 97.60
8. From 5110 data, there were 789 people who smoked and 1892 people who didn’t smoke
Check the Outlier within profit
aggregate(bmi ~ work_type, stroke, mean)
aggregate(bmi ~ work_type, stroke, var)
aggregate(bmi ~ work_type, stroke, sd)
boxplot(stroke$bmi)
From result above, we find posibilities for the outliers, but from our calculation, Sd value is around 7.0 ( my oppinion its still be tolerated), so the process may continue.
Correlation avg_glucose_level and bmi
cor(stroke$avg_glucose_level, stroke$bmi)
## [1] 0.1618624
plot(stroke$avg_glucose_level, stroke$bmi)
abline(lm(stroke$bmi ~ stroke$avg_glucose_level), col = "red")
stroke_stroke <- stroke[stroke$stroke == 1, ]
round(prop.table(table(stroke_stroke$gender))*100,2)
##
## Female Male
## 56.63 43.37
Answer: Women(Female) have strokes more frequently
stroke_stroke <- stroke[stroke$stroke == 1, ]
round(prop.table(table(stroke_stroke$smoking_status))*100,2)
##
## formerly smoked never smoked smokes Unknown
## 28.11 36.14 16.87 18.88
Answer: never smoked have strokes more frequently
stroke_age_stroke <- stroke[stroke$age >= 20 & stroke$stroke == 1, ]
round(prop.table(table(stroke_age_stroke$work_type))*100,2)
##
## children Govt_job Never_worked Private Self-employed
## 0.00 13.36 0.00 60.32 26.32
Answer: Private have strokes more frequently
xtabs(stroke ~ work_type + smoking_status, stroke)
## smoking_status
## work_type formerly smoked never smoked smokes Unknown
## children 0 0 0 2
## Govt_job 8 12 5 8
## Never_worked 0 0 0 0
## Private 43 48 29 29
## Self-employed 19 30 8 8
plot(xtabs(stroke ~ work_type + smoking_status, stroke))
Answer : Based on result above: work_type private with smoking_status never smoke make the highest Stroke order with sum
xtabs(hypertension ~ work_type + smoking_status, stroke)
## smoking_status
## work_type formerly smoked never smoked smokes Unknown
## children 0 0 0 0
## Govt_job 20 34 16 3
## Never_worked 0 0 0 0
## Private 63 130 59 29
## Self-employed 37 68 19 20
plot(xtabs(hypertension ~ work_type + smoking_status, stroke))
Answer : Based on result above: work_type private with smoking_status never smoke make the highest hypertension order with sum
xtabs(heart_disease ~ work_type + smoking_status, stroke)
## smoking_status
## work_type formerly smoked never smoked smokes Unknown
## children 0 0 0 1
## Govt_job 7 16 7 6
## Never_worked 0 0 0 0
## Private 45 56 36 21
## Self-employed 25 18 18 20
plot(xtabs(heart_disease ~ work_type + smoking_status, stroke))
Answer : Based on result above: work_type private with smoking_status never smoke make the highest heart_disease order with sum
stroke_stroke <- stroke[stroke$stroke == 1, ]
mean(stroke_stroke$age)
## [1] 67.72819
Answer : Based on result above: on average, at age 67 people had a stroke
1. Women have a higher chance of getting a stroke
2. value of glucosa has a correlation with bmi, the higher the value of glucosa, the higher the bmi
3. never smoked have strokes more frequently
4. work_type private with smoking_status never smoke make the highest Stroke order with sum
5. work_type private with smoking_status never smoke make the highest hypertension order with sum
6. work_type private with smoking_status never smoke make the highest heart_disease order with sum
7. Based on result above: on average, at age 67 people had a stroke.
Stay Healthy & Stay humble :)