There are six stages for this project:
Ask
This is a fake scenario for my Google Analytics capstone project.
The CDC has notified you, a senior epidemiologist, about a potential
epidemic. In the last few weeks, heart attack incidences have been
rapidly increasing. The CDC has an intervention plan. Your job is to
analyze the collected data that is statistically significant for public
health to allocate resources for heart attack risk campaigns.
Prepare
If you click the previous link, you will notice that the data
contained a trained sheet and a test sheet that serve different
functions for the competition since this is an analytics project I used
the training sheet only for two reasons: the trained sheet has the
dependent variable, and it is much larger.
Process
I used four diffrent tools: R,
SPSS,Tableau and Excel for cleaning, analyzing, Visualizating and
sharing the data. I used Excel
In order to divide the blood pressure column into systolic and diastolic
blood pressure, divide the Body Mass Index (BMI) into five categories
(underweight, normal, overweight, and obese), and multiply the number of
hours spent sleeping and being sedentary each day by seven to get the
number of hours per week.
Analyze
The data was analyzed through three difrnce analyses: Descriptive,
Correlation, and Prediction.
Descriptive Analyses:
By running this code a summarization measurements like the central
tendency and variability of the cleaned data will be displayed.
summary(train_Heart_Attack_Risk_Analysis_cleaned)
## Age Sex Cholesterol Systolic BP
## Min. :18.00 Length:7010 Min. :120.0 Min. : 90
## 1st Qu.:35.00 Class :character 1st Qu.:192.0 1st Qu.:112
## Median :53.00 Mode :character Median :259.0 Median :135
## Mean :53.51 Mean :259.9 Mean :135
## 3rd Qu.:72.00 3rd Qu.:329.0 3rd Qu.:158
## Max. :90.00 Max. :400.0 Max. :180
## Diastolic BP Heart Rate Diabetes Family History
## Min. : 60.00 Min. : 40.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 72.00 1st Qu.: 57.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 85.00 Median : 75.00 Median :1.0000 Median :0.0000
## Mean : 85.15 Mean : 75.11 Mean :0.6528 Mean :0.4919
## 3rd Qu.: 98.00 3rd Qu.: 93.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :110.00 Max. :110.00 Max. :1.0000 Max. :1.0000
## Smoking Obesity Alcohol Consumption Exercise Hours Per Week
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 0.002442
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 5.046024
## Median :1.0000 Median :0.0000 Median :1.0000 Median : 9.982968
## Mean :0.8963 Mean :0.4999 Mean :0.5959 Mean : 9.979109
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:15.029659
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :19.998709
## Diet Previous Heart Problems Medication Use Stress Level
## Length:7010 Min. :0.0000 Min. :0.0000 Min. : 1.000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 3.000
## Mode :character Median :0.0000 Median :1.0000 Median : 5.000
## Mean :0.4981 Mean :0.5001 Mean : 5.452
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :1.0000 Max. :10.000
## Sedentary Hours Per Day Income BMI BMI categories
## Min. : 0.00884 Min. : 20062 Min. :18.00 Length:7010
## 1st Qu.:20.80282 1st Qu.: 88368 1st Qu.:23.42 Class :character
## Median :41.55843 Median :157379 Median :28.74 Mode :character
## Mean :41.95805 Mean :158245 Mean :28.88
## 3rd Qu.:63.12314 3rd Qu.:227219 3rd Qu.:34.32
## Max. :83.99519 Max. :299954 Max. :39.99
## Triglycerides Physical Activity Days Per Week Sleep Hours Per Week
## Min. : 30.0 Min. :0.000 Min. :28.00
## 1st Qu.:221.0 1st Qu.:2.000 1st Qu.:35.00
## Median :416.0 Median :3.000 Median :49.00
## Mean :416.8 Mean :3.492 Mean :49.17
## 3rd Qu.:613.0 3rd Qu.:5.000 3rd Qu.:63.00
## Max. :800.0 Max. :7.000 Max. :70.00
## Country Continent Hemisphere Heart Attack Risk
## Length:7010 Length:7010 Length:7010 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :0.0000
## Mean :0.3572
## 3rd Qu.:1.0000
## Max. :1.0000
Figure.1 Population Pyramid
Figure.1 illustrates the population pyramid. The highest age
group is 75 and above. Therefore, this is an aging population.
Heart Attack Risk Map
The map displayed the distribution of heart attack risk across
different countries interactively, with filtering legends for countries,
average cholesterol, average triglycerides, and diabetes. Feel free to
play with it!
Correlation
Analyses:
Table.1: Cross tabulation between the Heart Attack Risk
and Diabetes
Table.1 displays the highest proportion of the two-by-two
table: those who do not have heart attack risk and have diabetes with
(n=2981).
Table.2: Association between Heart Attack Risk and
Diabetes
Table.2 illustrates the Chi-square test of the association
between heart attack risk and diabetes. There is a significant
association between both variables (X2 = 6.972, P<0.05).
Prediction
Analyses:
Binary Logistic Regression
A Binary Logistics Regression statistical test was performed
predicting The Heart Attack Risk as Dependant Variable (DV), Using the
Indepandant Variables: Age, Family History, Income, Cholesterol,
Triglycerides, Heart Rate, Systolic BP, Diastolic BP, BMI, Diabetes,
Obesity, Previous Heart Problems, Medication Use, Exercise Hours Per
Week, Sedentary Hours Per Week, Physical Activity Days Per Week,
Smoking, Alcohol Consumption, Stress Level.
Table.3: Binary Logistic Regression
Table.3 The Binary Logistic Regression findings are -2 Log
Likelihood value of 9115.345 indicates a reasonable model fit. The
Constant has a B coefficient of -1.067, a p-value less than 0.001,
indicating it is a statistically significant predictor in the model. The
p-value for the variable “Diabetes” = 0.006 the B coefficient is 0.145
and Exp(B) of 1.156. This means that for every unit increase in the
diabetes variable, the odds of the outcome occurring increase by a
factor of 1.156, holding all other variables constant.
Sharing
Act
based on the three different analyses, the population is aging, and
diabetes is the highest risk compared to other comorbidities. Heart
attack risk covers six continents. Performing both the chi-square test
and binary logistic regression, suggesting that diabetes is a
statistically significant variable for heart attack risk. Therefore, the
allocation of resources and the heart attack risk campaigns should be
for diabetic elderly patients around the globe.