1 Explanation

Hi !! My Name is Caesar Welcome to my Rmd :) in this LBB i will use data healthcare-dataset-stroke-data.csv from https://www.kaggle.com. I hope you enjoy it !

2 Input Data

Make sure our data placed in the same folder our R project data.

stroke <- read.csv("data_input/healthcare-dataset-stroke-data.csv")

Input data is DONE ! then let’s get started

2.1 Data Inspection

head(stroke)

tail(stroke)

dim(stroke)

## [1] 5110   12

names(stroke)

##  [1] "id"                "gender"            "age"              
##  [4] "hypertension"      "heart_disease"     "ever_married"     
##  [7] "work_type"         "Residence_type"    "avg_glucose_level"
## [10] "bmi"               "smoking_status"    "stroke"

From our inspection we can conclude :
* stroke data contain 5110 of rows and 12 of coloumns
* Each of column name : “id”, “gender”, “age”, “hypertension”, “heart_disease”,“ever_married”, “work_type”, “Residence_type”
“avg_glucose_level”, “bmi”, “smoking_status”, “stroke”

2.2 Data Cleansing & Coertions

Check data type for each column

str(stroke)

## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 11 32.5 34.4 24 29 27.4 22.8 24 24.2 ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)

stroke$gender <- as.factor(stroke$gender)
stroke$ever_married <- as.factor(stroke$ever_married)
stroke$work_type <- as.factor(stroke$work_type)
stroke$Residence_type <- as.factor(stroke$Residence_type)
stroke$smoking_status <- as.factor(stroke$smoking_status)


str(stroke)

## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 2 2 1 1 1 ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ work_type        : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
##  $ Residence_type   : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 11 32.5 34.4 24 29 27.4 22.8 24 24.2 ...
##  $ smoking_status   : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

Each of column already changed into desired data type

Cek for missing value

colSums(is.na(stroke))

##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0

anyNA(stroke)

## [1] FALSE

Great!! No missing value

Now, Stroke dataset is ready to be processed and analyzed

3 Data Explanation

Brief explanation

summary(stroke)

##        id           gender          age         hypertension    
##  Min.   :   67   Female:2994   Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Male  :2116   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932                 Median :45.00   Median :0.00000  
##  Mean   :36518                 Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                 3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                 Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married         work_type    Residence_type
##  Min.   :0.00000   No :1757     children     : 687   Rural:2514    
##  1st Qu.:0.00000   Yes:3353     Govt_job     : 657   Urban:2596    
##  Median :0.00000                Never_worked :  22                 
##  Mean   :0.05401                Private      :2925                 
##  3rd Qu.:0.00000                Self-employed: 819                 
##  Max.   :1.00000                                                   
##  avg_glucose_level      bmi                smoking_status     stroke       
##  Min.   : 55.12    Min.   :10.30   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.: 77.25    1st Qu.:23.50   never smoked   :1892   1st Qu.:0.00000  
##  Median : 91.89    Median :28.00   smokes         : 789   Median :0.00000  
##  Mean   :106.15    Mean   :29.08   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:114.09    3rd Qu.:33.20                          3rd Qu.:0.00000  
##  Max.   :271.74    Max.   :97.60                          Max.   :1.00000

Summary :
1. In this data, it is known that there are 2116 male sex and 2994 female sex
2. The average age is 43.23 years, with the youngest age of 0.08 years and the oldest age of 82 years
3. From 5110 data, there were 3353 people who were married and 1757 people who were not married
4. private is the most work type with 2925 data
5. there are 2514 people living in rural and 56 people living in urban
6. average glucose level : 106.15, min: 55.12, and max: 271.74
7. average BMI : 29.08, min: 10.30, and max: 97.60
8. From 5110 data, there were 789 people who smoked and 1892 people who didn’t smoke

Check the Outlier within profit

aggregate(bmi ~ work_type, stroke, mean)

aggregate(bmi ~ work_type, stroke, var)

aggregate(bmi ~ work_type, stroke, sd)

boxplot(stroke$bmi)

From result above, we find posibilities for the outliers, but from our calculation, Sd value is around 7.0 ( my oppinion its still be tolerated), so the process may continue.

Correlation avg_glucose_level and bmi

cor(stroke$avg_glucose_level, stroke$bmi)

## [1] 0.1618624

plot(stroke$avg_glucose_level, stroke$bmi)
abline(lm(stroke$bmi ~ stroke$avg_glucose_level), col = "red")

4 Data Manipulation & Transformation

which gender has the most strokes ?

stroke_stroke <- stroke[stroke$stroke == 1, ]
round(prop.table(table(stroke_stroke$gender))*100,2)

## 
## Female   Male 
##  56.63  43.37

Answer: Women(Female) have strokes more frequently

which smoking status had the most frequent strokes ?

stroke_stroke <- stroke[stroke$stroke == 1, ]
round(prop.table(table(stroke_stroke$smoking_status))*100,2)

## 
## formerly smoked    never smoked          smokes         Unknown 
##           28.11           36.14           16.87           18.88

Answer: never smoked have strokes more frequently

what types of work have the most strokes ?

stroke_age_stroke <- stroke[stroke$age >= 20 & stroke$stroke == 1, ]
round(prop.table(table(stroke_age_stroke$work_type))*100,2)

## 
##      children      Govt_job  Never_worked       Private Self-employed 
##          0.00         13.36          0.00         60.32         26.32

Answer: Private have strokes more frequently

How much total of Stroke order from each Work_type and smoking_status, and which is the highest?

xtabs(stroke ~ work_type + smoking_status, stroke)

##                smoking_status
## work_type       formerly smoked never smoked smokes Unknown
##   children                    0            0      0       2
##   Govt_job                    8           12      5       8
##   Never_worked                0            0      0       0
##   Private                    43           48     29      29
##   Self-employed              19           30      8       8

plot(xtabs(stroke ~ work_type + smoking_status, stroke))

Answer : Based on result above: work_type private with smoking_status never smoke make the highest Stroke order with sum

How much total of hypertension order from each Work_type and smoking_status, and which is the highest?

xtabs(hypertension ~ work_type + smoking_status, stroke)

##                smoking_status
## work_type       formerly smoked never smoked smokes Unknown
##   children                    0            0      0       0
##   Govt_job                   20           34     16       3
##   Never_worked                0            0      0       0
##   Private                    63          130     59      29
##   Self-employed              37           68     19      20

plot(xtabs(hypertension ~ work_type + smoking_status, stroke))

Answer : Based on result above: work_type private with smoking_status never smoke make the highest hypertension order with sum

How much total of heart_disease order from each Work_type and smoking_status, and which is the highest?

xtabs(heart_disease ~ work_type + smoking_status, stroke)

##                smoking_status
## work_type       formerly smoked never smoked smokes Unknown
##   children                    0            0      0       1
##   Govt_job                    7           16      7       6
##   Never_worked                0            0      0       0
##   Private                    45           56     36      21
##   Self-employed              25           18     18      20

plot(xtabs(heart_disease ~ work_type + smoking_status, stroke))

Answer : Based on result above: work_type private with smoking_status never smoke make the highest heart_disease order with sum

average at what age the rate of stroke sufferers ?

stroke_stroke <- stroke[stroke$stroke == 1, ]
mean(stroke_stroke$age)

## [1] 67.72819

Answer : Based on result above: on average, at age 67 people had a stroke

5 Explanatory Text

1. Women have a higher chance of getting a stroke
2. value of glucosa has a correlation with bmi, the higher the value of glucosa, the higher the bmi
3. never smoked have strokes more frequently
4. work_type private with smoking_status never smoke make the highest Stroke order with sum
5. work_type private with smoking_status never smoke make the highest hypertension order with sum
6. work_type private with smoking_status never smoke make the highest heart_disease order with sum
7. Based on result above: on average, at age 67 people had a stroke.
Stay Healthy & Stay humble :)

Exploratory Data Analysis - Stroke Dataset

Ibnu Caesar

3/20/2021