setwd("C:/Users/user/Desktop")
heart<-read.csv(file="heart_2020_cleaned.csv",header=TRUE)
str(heart) #structure of heart
## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : chr "No" "No" "No" "No" ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : chr "Yes" "No" "Yes" "No" ...
## $ AlcoholDrinking : chr "No" "No" "No" "No" ...
## $ Stroke : chr "No" "Yes" "No" "No" ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : chr "No" "No" "No" "No" ...
## $ Sex : chr "Female" "Female" "Male" "Female" ...
## $ AgeCategory : chr "55-59" "80 or older" "65-69" "75-79" ...
## $ Race : chr "White" "White" "White" "White" ...
## $ Diabetic : chr "Yes" "No" "Yes" "No" ...
## $ PhysicalActivity: chr "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr "Very good" "Very good" "Fair" "Good" ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : chr "Yes" "No" "Yes" "No" ...
## $ KidneyDisease : chr "No" "No" "No" "No" ...
## $ SkinCancer : chr "Yes" "No" "No" "Yes" ...
dim(heart) #rows and columns
## [1] 319795 18
summary(heart) #summary of heart
## HeartDisease BMI Smoking AlcoholDrinking
## Length:319795 Min. :12.02 Length:319795 Length:319795
## Class :character 1st Qu.:24.03 Class :character Class :character
## Mode :character Median :27.34 Mode :character Mode :character
## Mean :28.33
## 3rd Qu.:31.42
## Max. :94.85
## Stroke PhysicalHealth MentalHealth DiffWalking
## Length:319795 Min. : 0.000 Min. : 0.000 Length:319795
## Class :character 1st Qu.: 0.000 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Median : 0.000 Mode :character
## Mean : 3.372 Mean : 3.898
## 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :30.000 Max. :30.000
## Sex AgeCategory Race Diabetic
## Length:319795 Length:319795 Length:319795 Length:319795
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PhysicalActivity GenHealth SleepTime Asthma
## Length:319795 Length:319795 Min. : 1.000 Length:319795
## Class :character Class :character 1st Qu.: 6.000 Class :character
## Mode :character Mode :character Median : 7.000 Mode :character
## Mean : 7.097
## 3rd Qu.: 8.000
## Max. :24.000
## KidneyDisease SkinCancer
## Length:319795 Length:319795
## Class :character Class :character
## Mode :character Mode :character
##
##
##
sum(is.na(heart)) #to check if there is NA in data
## [1] 0
The data is from Kaggle. URL: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download
The raw data contains 319795 rows and 18 columns.
There is no NA in data.
The variables are heartdisease, bmi, smoking, alcoholdrinking, stroke, physicalhealth, mentalhealth,diffwalking, sex, agecategory, race, diabetic, physicalactivity, genhealth, sleeptime, asthma, kidneydisease, skincancer.
Here is the detail of variables.
Heartdisease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI).
BMI: Body Mass Index (BMI).
Smoking: Have you smoked at least 100 cigarettes in your entire life? Alcoholdrinking: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week.
Stroke: (Ever told) (you had) a stroke?
Physicalhealth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30.
Mentalhealth: Thinking about your mental health, for how many days during the past 30 days was your mental health not good?
Diffwalking: Do you have serious difficulty walking or climbing stairs?
Sex: Are you male or female?
Agecategory:Fourteen-level age category.
Race: Imputed race/ethnicity value.
Diabetic: (Ever told) (you had) diabetes?
Physicalactivity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
Genhealth: Would you say that in general your health is.
Sleeptime: On average, how many hours of sleep do you get in a 24-hour period?
Asthma: (Ever told) (you had) asthma?
Kidneydisease: Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
Skincancer: (Ever told) (you had) skin cancer?
I want to find a linear model to predict whether a person has heartdisease or not.
a <- sub("No","0",heart$HeartDisease)
b <- sub("Yes","1",a)
heart$HeartDisease <- b
a1 <- sub("No","0",heart$Smoking)
b1 <- sub("Yes","1",a1)
heart$Smoking <- b1
a2 <- sub("Male","0",heart$Sex)
b2 <- sub("Female","1",a2)
heart$Sex <- b2
a3 <- sub("No","0",heart$Asthma)
b3 <- sub("Yes","1",a3)
heart$Asthma <- b3
a4 <- sub("No","0",heart$DiffWalking)
b4 <- sub("Yes","1",a4)
heart$DiffWalking <- b4
a5 <- sub("No","0",heart$Stroke)
b5 <- sub("Yes","1",a5)
heart$Stroke <- b5
heart_new<-heart[,-c(4,10,11,12,13,14,17,18)]
View(heart_new)
I just choose some variables to predict. So I removed some variables.
I changed who has HeartDisease to 1, who doesn’t have to 0.
I changed who smoked before to 1, who doesn’t have to 0.
I changed who is female to 1, who is male to 0.
I changed who has asthma to 1, who doesn’t have to 0.
I changed who has diffwalking to 1, who doesn’t have to 0.
I changed who has stroke to 1, who doesn’t have to 0.
heart_new$HeartDisease<-as.numeric(heart_new$HeartDisease)
heart_new$Smoking<-as.numeric(heart_new$Smoking)
heart_new$Stroke<-as.numeric(heart_new$Stroke)
heart_new$DiffWalking<-as.numeric(heart_new$DiffWalking)
heart_new$Sex<-as.numeric(heart_new$Sex)
heart_new$Asthma<-as.numeric(heart_new$Asthma)
par(mfrow=c(2,3))
histogram(heart_new$HeartDisease,col="goldenrod2",xlab="HeartDisease")
histogram(heart_new$Smoking,col="darkblue",xlab="Smoking")
histogram(heart_new$Stroke,col="darkgreen",xlab="Stroke")
histogram(heart_new$DiffWalking,col="red",xlab="DiffWalking")
histogram(heart_new$Sex,col="black",xlab="Sex")
histogram(heart_new$Asthma,col="lightblue",xlab="Asthma")
par(mfrow=c(1,4))
boxplot(heart_new$BMI)
mean(heart_new$BMI)
## [1] 28.3254
boxplot(heart_new$PhysicalHealth)
boxplot(heart_new$MentalHealth)
boxplot(heart_new$SleepTime)
plot(density(heart_new$BMI));plot(density(heart_new$PhysicalHealth))
plot(density(heart_new$MentalHealth));plot(density(heart_new$SleepTime))
A funny thing is that these people seems a little fatter than normal people. Their BMI mean is 28.325. The normal standard is 18-24.
The person in smoking and sex group are similar, but other groups have a great difference.
rcorr(as.matrix(heart_new), type="pearson") #the correlation matrix of all variables
## HeartDisease BMI Smoking Stroke PhysicalHealth MentalHealth
## HeartDisease 1.00 0.05 0.11 0.20 0.17 0.03
## BMI 0.05 1.00 0.02 0.02 0.11 0.06
## Smoking 0.11 0.02 1.00 0.06 0.12 0.09
## Stroke 0.20 0.02 0.06 1.00 0.14 0.05
## PhysicalHealth 0.17 0.11 0.12 0.14 1.00 0.29
## MentalHealth 0.03 0.06 0.09 0.05 0.29 1.00
## DiffWalking 0.20 0.18 0.12 0.17 0.43 0.15
## Sex -0.07 -0.03 -0.09 0.00 0.04 0.10
## SleepTime 0.01 -0.05 -0.03 0.01 -0.06 -0.12
## Asthma 0.04 0.09 0.02 0.04 0.12 0.11
## DiffWalking Sex SleepTime Asthma
## HeartDisease 0.20 -0.07 0.01 0.04
## BMI 0.18 -0.03 -0.05 0.09
## Smoking 0.12 -0.09 -0.03 0.02
## Stroke 0.17 0.00 0.01 0.04
## PhysicalHealth 0.43 0.04 -0.06 0.12
## MentalHealth 0.15 0.10 -0.12 0.11
## DiffWalking 1.00 0.07 -0.02 0.10
## Sex 0.07 1.00 0.02 0.07
## SleepTime -0.02 0.02 1.00 -0.05
## Asthma 0.10 0.07 -0.05 1.00
##
## n= 319795
##
##
## P
## HeartDisease BMI Smoking Stroke PhysicalHealth MentalHealth
## HeartDisease 0.0000 0.0000 0.0000 0.0000 0.0000
## BMI 0.0000 0.0000 0.0000 0.0000 0.0000
## Smoking 0.0000 0.0000 0.0000 0.0000 0.0000
## Stroke 0.0000 0.0000 0.0000 0.0000 0.0000
## PhysicalHealth 0.0000 0.0000 0.0000 0.0000 0.0000
## MentalHealth 0.0000 0.0000 0.0000 0.0000 0.0000
## DiffWalking 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Sex 0.0000 0.0000 0.0000 0.0805 0.0000 0.0000
## SleepTime 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Asthma 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## DiffWalking Sex SleepTime Asthma
## HeartDisease 0.0000 0.0000 0.0000 0.0000
## BMI 0.0000 0.0000 0.0000 0.0000
## Smoking 0.0000 0.0000 0.0000 0.0000
## Stroke 0.0000 0.0805 0.0000 0.0000
## PhysicalHealth 0.0000 0.0000 0.0000 0.0000
## MentalHealth 0.0000 0.0000 0.0000 0.0000
## DiffWalking 0.0000 0.0000 0.0000
## Sex 0.0000 0.0000 0.0000
## SleepTime 0.0000 0.0000 0.0000
## Asthma 0.0000 0.0000 0.0000
All of the variables are significant correlated with each other. We need to check if the variables have collinearity or not.
heart_lm<- lm(HeartDisease ~BMI+Smoking+Stroke+PhysicalHealth+MentalHealth+DiffWalking+Sex, data = heart_new)
summary(heart_lm) #linear model
##
## Call:
## lm(formula = HeartDisease ~ BMI + Smoking + Stroke + PhysicalHealth +
## MentalHealth + DiffWalking + Sex, data = heart_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56016 -0.09683 -0.05812 -0.01859 1.01269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.518e-02 2.285e-03 19.773 < 2e-16 ***
## BMI 5.458e-04 7.583e-05 7.198 6.12e-13 ***
## Smoking 3.800e-02 9.771e-04 38.888 < 2e-16 ***
## Stroke 2.320e-01 2.529e-03 91.747 < 2e-16 ***
## PhysicalHealth 3.291e-03 6.826e-05 48.209 < 2e-16 ***
## MentalHealth -8.728e-04 6.253e-05 -13.960 < 2e-16 ***
## DiffWalking 1.071e-01 1.552e-03 68.980 < 2e-16 ***
## Sex -4.200e-02 9.591e-04 -43.785 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2675 on 319787 degrees of freedom
## Multiple R-squared: 0.08609, Adjusted R-squared: 0.08607
## F-statistic: 4304 on 7 and 319787 DF, p-value: < 2.2e-16
plot(heart_lm, which = 1) # Plot of the fitted values against the residuals for cvmod, with a line showing the relationship between the two.
plot(heart_lm, which = 2) # Plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals.
plot(heart_lm, which = 3) # Plot of the fitted values (model predictions) against the square root of the abs standardised residuals.
vif(heart_lm, digits = 3) # to check collinearity
## BMI Smoking Stroke PhysicalHealth MentalHealth
## 1.038640 1.034400 1.038038 1.316796 1.106100
## DiffWalking Sex
## 1.287861 1.025652
Every single variables that I put in the model has arrived significant standard in 5% confidence level.
The funny thing is that although the variables are all significant, R-squared is only 0.0861, it indicated that the regression model accounts for 8.61% of the variability in the outcome measure.
And another funny thing is that almost every variable has positive coefficients but mental health and sex aren’t. We can say that if our mental health is greater, we have less possibility to have heart disease. And if one is female(1), she has less possibility to have heart disease(0). And for BMI, Smoking, Stroke and DiffWalking are also intuitive, if we have greater numbers in these variables, we have more possibility to have heart disease. But something strange is in PhysicalHealth, if we have greater numbers in PhysicalHealth , we have more possibility to have heart disease.
Just take two example of coeffiecient explanations, Regression coefficient of 5.458e-04(BMI) means that if increase BMI by 1, then Heart disease will increase by 5.458e-04 while control other variables. Regression coefficient of 3.800e-02(Smoking) means that if increase Smoking by 1, then Heart disease will increase by 3.800e-02 while control other variables. It also assumes that the smoking yes(1) and smoking no(0) group differences.
The VIF values are all below 10, we can conclude that they don’t have collinearity problem.