Jack worked on Heart Disease UCI Data Preprocessing, Introduction, UCI verses Framingham, and Cholesterol and Heart Rate Regression Graph
Omar contributed to preprocessing of the Framingham data, the code for training and testing sets used to run the linear model for Ten Year CHD predictions, and helped in the completion of the conclusion.
Amanda Wade worked on the introduction to the stroke data as well as histograms for various variables.
Aden Eagle preprocessed and oversampled the data with SMOTE and implemented K-nearest neighbors and neural network classifiers.
Jillian Hanley worked on the correlation matrix of the stroke data and the binary logistic regression model.
Our project focuses on stroke and heart disease and how different factors may play a role in influencing the cause of these diseases. Indicators such as blood pressure, smoking, cholesterol levels, etc. are notoriously known for being contributing factors to these diseases. But is that necessarily true or is there a more complex answer to this pressing topic? We analyzed the correlation of these variables with the occurrence of heart disease/stroke and employed various statistical tools to ascertain their significance. We also utilized machine learning algorithms to attempt to determine if a given person had a stroke or suffered from heart disease using this health information.
Two data sets were initially chosen to represent heart disease, but after looking at the preprocessing of both data sets the Framingham data seemed to be the better choice. The Framingham data had significantly more data to draw results from and included columns like heart rate and smoking which the UCI data lacked.
We first opened and viewed the data.
heart <- read.csv("framingham.csv")
head(heart[c(1:11)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| male | age | education | currentSmoker | cigsPerDay | BPMeds | prevalentStroke | prevalentHyp | diabetes | totChol | sysBP |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 39 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 195 | 106.0 |
| 0 | 46 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 250 | 121.0 |
| 1 | 48 | 1 | 1 | 20 | 0 | 0 | 0 | 0 | 245 | 127.5 |
| 0 | 61 | 3 | 1 | 30 | 0 | 0 | 1 | 0 | 225 | 150.0 |
| 0 | 46 | 3 | 1 | 23 | 0 | 0 | 0 | 0 | 285 | 130.0 |
| 0 | 43 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 228 | 180.0 |
head(heart[c(12:16)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| diaBP | BMI | heartRate | glucose | TenYearCHD |
|---|---|---|---|---|
| 70 | 26.97 | 80 | 77 | 0 |
| 81 | 28.73 | 95 | 76 | 0 |
| 80 | 25.34 | 75 | 70 | 0 |
| 95 | 28.58 | 65 | 103 | 1 |
| 84 | 23.10 | 85 | 85 | 0 |
| 110 | 30.30 | 77 | 99 | 0 |
#Removing cases with N/A values
no_na_heart <- drop_na(heart)
We then performed a simple regression analysis between cholesterol and heart rate.
ggplot(heart) + geom_point(mapping = aes(x = totChol, y = heartRate, shape = as.factor(TenYearCHD))) + stat_smooth(method = 'lm', aes(x = totChol, y = heartRate), se = FALSE)
This regression illustrates the relationship between cholesterol and heart rate. As seen in the graph, there is a positive correlation, but more interestingly there is the cluster of points which include both those with and without heart disease. This suggests that there’s not one definitive cause to heart disease.
Therefore, we decided to examine the capacity of multiple variables to predict whether a person would develop Coronary Heart Disease in ten years from data collection (TenYearCHD).
#Encoding categorical variables
dummies <- dummyVars(" ~ .", data=no_na_heart)
encoded <- data.frame(predict(dummies, newdata=no_na_heart))
#Splitting into only train and test sets
encoded_n <- nrow(encoded)
valTestIndex <- sample(1:encoded_n, size = round(0.4*encoded_n), replace = FALSE)
train <- encoded[-valTestIndex ,]
test <- encoded[valTestIndex ,]
#Model with training dataset, predicting 10-year CHD
model <- lm(TenYearCHD ~ ., data=train)
summary(model)
##
## Call:
## lm(formula = TenYearCHD ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67096 -0.19050 -0.11123 -0.02077 1.05504
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5807970 0.1024672 -5.668 1.64e-08 ***
## male 0.0577204 0.0163672 3.527 0.00043 ***
## age 0.0053622 0.0010031 5.346 9.96e-08 ***
## education -0.0070035 0.0074558 -0.939 0.34766
## currentSmoker 0.0280603 0.0236865 1.185 0.23628
## cigsPerDay 0.0011164 0.0010280 1.086 0.27757
## BPMeds 0.0606048 0.0464378 1.305 0.19201
## prevalentStroke 0.1008361 0.0938657 1.074 0.28283
## prevalentHyp 0.0212887 0.0227128 0.937 0.34871
## diabetes 0.0004577 0.0563292 0.008 0.99352
## totChol 0.0003249 0.0001765 1.841 0.06577 .
## sysBP 0.0029838 0.0006588 4.529 6.24e-06 ***
## diaBP -0.0014686 0.0010855 -1.353 0.17622
## BMI 0.0015924 0.0020070 0.793 0.42763
## heartRate -0.0007357 0.0006330 -1.162 0.24526
## glucose 0.0010984 0.0003887 2.826 0.00476 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3473 on 2178 degrees of freedom
## Multiple R-squared: 0.09437, Adjusted R-squared: 0.08813
## F-statistic: 15.13 on 15 and 2178 DF, p-value: < 2.2e-16
#Model with test dataset
model <- lm(TenYearCHD ~ ., data=test)
summary(model)
##
## Call:
## lm(formula = TenYearCHD ~ ., data = test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63361 -0.18558 -0.09101 0.00582 1.05524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5661122 0.1243122 -4.554 5.71e-06 ***
## male 0.0569643 0.0193454 2.945 0.00329 **
## age 0.0097106 0.0012042 8.064 1.54e-15 ***
## education -0.0029227 0.0087468 -0.334 0.73832
## currentSmoker -0.0222564 0.0283880 -0.784 0.43316
## cigsPerDay 0.0038083 0.0012083 3.152 0.00166 **
## BPMeds 0.0248305 0.0528585 0.470 0.63860
## prevalentStroke 0.2119191 0.1278638 1.657 0.09766 .
## prevalentHyp 0.0345475 0.0274034 1.261 0.20762
## diabetes 0.0735565 0.0717340 1.025 0.30534
## totChol -0.0001723 0.0002116 -0.814 0.41566
## sysBP 0.0016867 0.0007557 2.232 0.02577 *
## diaBP -0.0005829 0.0012465 -0.468 0.64010
## BMI -0.0025116 0.0024326 -1.032 0.30201
## heartRate 0.0001730 0.0007901 0.219 0.82675
## glucose 0.0011334 0.0004843 2.340 0.01940 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3323 on 1446 degrees of freedom
## Multiple R-squared: 0.1226, Adjusted R-squared: 0.1135
## F-statistic: 13.47 on 15 and 1446 DF, p-value: < 2.2e-16
A linear model employing all the variables in the data set yielded weak predictive capability, with a multiple R^2 = 0.1028. Again, this further supports there is great variability in prediction and such a question might necessitate more complex analyses.
#remove incomplete cases
data <- heart %>% na.omit() %>%
select(education, age, cigsPerDay, BPMeds, prevalentStroke, prevalentHyp, diabetes, totChol, TenYearCHD, BMI, glucose, sysBP, diaBP)
#View(data)
KNN works by finding the distances between points in a data set. A k value is set, which determines a set amount of points to analyze surrounding a specific point. The specific point is classified based on the majority of the type of each point within k. K is usually set as the square root of total observations, but due to the nature of this dataset, this was not optimal. Instead, K of 5 was used since it produced the most reliable results. There were 1380 true negative cases and 5 false negative cases. There were also 232 true positives cases and 18 false posiitve cases. As shown by the graphs, there was no clear factor that contributed to an increase risk in Heart Disease.
heart = no_na_heart
num.vars <- sapply(heart, is.numeric)
set.seed(123)
test <- 2001:3635
train.heart <- heart[-test,]
test.heart <- heart[test,]
train.def <- heart$TenYearCHD[-test]
test.def <- heart$TenYearCHD[test]
preprocessor <- preProcess(train.heart, method=c("center", "scale"))
normalized_train <- predict(preprocessor, train.heart)
normalized_test <- predict(preprocessor, test.heart)
train.heart <- normalized_train
test.heart <- normalized_test
knn.1 <- knn(train.heart, test.heart, train.def, k =10)
table(knn.1 ,test.def)
## test.def
## knn.1 0 1
## 0 1380 18
## 1 5 232
plot.df = data.frame(test.heart, predicted = knn.1)
names(plot.df)
## [1] "male" "age" "education" "currentSmoker"
## [5] "cigsPerDay" "BPMeds" "prevalentStroke" "prevalentHyp"
## [9] "diabetes" "totChol" "sysBP" "diaBP"
## [13] "BMI" "heartRate" "glucose" "TenYearCHD"
## [17] "predicted"
plot.df1 = data.frame(x = plot.df$age,
y = plot.df$heartRate,
predicted = plot.df$predicted)
ggplot(plot.df, aes(age, heartRate, color = predicted, fill = predicted, alpha = predicted, shape = as.factor(prevalentHyp))) +
geom_point(size = 3)
ggplot(plot.df, aes(totChol, glucose, color = predicted, fill = predicted, alpha = predicted, shape = as.factor(diabetes))) +
geom_point(size = 3)
ggplot(plot.df, aes(cigsPerDay, heartRate, color = predicted, fill = predicted, alpha = predicted)) + geom_point(size = 3)
I hoped to use ordination to explore patterns in the data. I initially used principal componant analysis but found that the differences in scales of the variables made this data difficult to analyze in this way. Some variabals had very large number while others were binary. Non-metric multidimensional scaling is much better at handling data like this, especially since any ordination distance can be used. I tried bothe euclidean an bray-curtis distances and achieved similar results. However, I think the Bray-Curtis dissimilarity handled binary variables better. I excluded data about sex because it did not correlate well with any of the variables and allowed for better clustering.
The dataset was large and made running NMDS on the whole thing difficult. I opted to take 10 random samples of 300 and run the same NMDS analysis on all of them.I grouped according to prevalence of hypertension and found that it generally accounted for most of the individuals.
#Principle component analysis
myrda <- rda(data)
biplot(myrda,
display = c("species", "sites"),
type = c("text", "points"))
#nonmetric multidimensional scaling
#This takes forever to run, but with enough time is probably possible
#heartMDS <- metaMDS(data, distance = "bray", k = 3, trymax = 100)
#stressplot(heartMDS)
#ordiplot(heartMDS)
#plot(envfit(MDS, mysample, permutations = 999))
#solution
for (x in 1:10){
mysample <- data[sample(1:nrow(data), 300, replace=TRUE),]
MDS <- metaMDS(mysample, distance = "bray", k = 3, trymax = 100)
stressplot(MDS)
ordiplot(MDS)
plot(envfit(MDS, mysample, permutations = 999))
ordiellipse(MDS, mysample$prevalentHyp, display = "sites", col = "orange")
}
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.09007612
## Run 1 stress 0.1073506
## Run 2 stress 0.08674598
## ... New best solution
## ... Procrustes: rmse 0.02062612 max resid 0.1424498
## Run 3 stress 0.0911402
## Run 4 stress 0.08991935
## Run 5 stress 0.08759387
## Run 6 stress 0.1111169
## Run 7 stress 0.09007544
## Run 8 stress 0.1094545
## Run 9 stress 0.08961089
## Run 10 stress 0.09676924
## Run 11 stress 0.1210348
## Run 12 stress 0.09676082
## Run 13 stress 0.1127997
## Run 14 stress 0.102722
## Run 15 stress 0.1136992
## Run 16 stress 0.08825629
## Run 17 stress 0.1163332
## Run 18 stress 0.09074689
## Run 19 stress 0.112221
## Run 20 stress 0.1018035
## Run 21 stress 0.09025851
## Run 22 stress 0.09115414
## Run 23 stress 0.08992895
## Run 24 stress 0.09687533
## Run 25 stress 0.09784998
## Run 26 stress 0.09016546
## Run 27 stress 0.08916215
## Run 28 stress 0.09013213
## Run 29 stress 0.1161327
## Run 30 stress 0.08764699
## Run 31 stress 0.09007425
## Run 32 stress 0.09088187
## Run 33 stress 0.08968621
## Run 34 stress 0.112364
## Run 35 stress 0.1113917
## Run 36 stress 0.1084918
## Run 37 stress 0.1195211
## Run 38 stress 0.09700459
## Run 39 stress 0.09683309
## Run 40 stress 0.08742865
## Run 41 stress 0.1089835
## Run 42 stress 0.1175307
## Run 43 stress 0.09108971
## Run 44 stress 0.09118405
## Run 45 stress 0.09065262
## Run 46 stress 0.0890751
## Run 47 stress 0.09014924
## Run 48 stress 0.0913244
## Run 49 stress 0.08985897
## Run 50 stress 0.1200683
## Run 51 stress 0.09037586
## Run 52 stress 0.09477593
## Run 53 stress 0.09115659
## Run 54 stress 0.1221
## Run 55 stress 0.1132699
## Run 56 stress 0.08845211
## Run 57 stress 0.1096282
## Run 58 stress 0.1127558
## Run 59 stress 0.08944135
## Run 60 stress 0.113441
## Run 61 stress 0.09618247
## Run 62 stress 0.08770057
## Run 63 stress 0.09072484
## Run 64 stress 0.1182711
## Run 65 stress 0.08943793
## Run 66 stress 0.08882119
## Run 67 stress 0.1005616
## Run 68 stress 0.1165763
## Run 69 stress 0.08739737
## Run 70 stress 0.08942309
## Run 71 stress 0.09821938
## Run 72 stress 0.09031389
## Run 73 stress 0.08682078
## ... Procrustes: rmse 0.002618574 max resid 0.02539316
## Run 74 stress 0.09772246
## Run 75 stress 0.112889
## Run 76 stress 0.09628904
## Run 77 stress 0.08825686
## Run 78 stress 0.09074472
## Run 79 stress 0.09746435
## Run 80 stress 0.08771056
## Run 81 stress 0.09164397
## Run 82 stress 0.08928741
## Run 83 stress 0.08968452
## Run 84 stress 0.09009723
## Run 85 stress 0.09548798
## Run 86 stress 0.0898584
## Run 87 stress 0.09613614
## Run 88 stress 0.08881778
## Run 89 stress 0.08739631
## Run 90 stress 0.1155659
## Run 91 stress 0.09171939
## Run 92 stress 0.09651834
## Run 93 stress 0.09863562
## Run 94 stress 0.1066123
## Run 95 stress 0.09074072
## Run 96 stress 0.1118559
## Run 97 stress 0.09016561
## Run 98 stress 0.1076461
## Run 99 stress 0.09086455
## Run 100 stress 0.08765776
## *** No convergence -- monoMDS stopping criteria:
## 17: no. of iterations >= maxit
## 69: stress ratio > sratmax
## 14: scale factor of the gradient < sfgrmin
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07988605
## Run 1 stress 0.08045534
## Run 2 stress 0.07986423
## ... New best solution
## ... Procrustes: rmse 0.000889752 max resid 0.01494961
## Run 3 stress 0.07988797
## ... Procrustes: rmse 0.007114556 max resid 0.121257
## Run 4 stress 0.07988438
## ... Procrustes: rmse 0.0009411556 max resid 0.01488386
## Run 5 stress 0.08113374
## Run 6 stress 0.08062264
## Run 7 stress 0.08064795
## Run 8 stress 0.08045551
## Run 9 stress 0.08194212
## Run 10 stress 0.08114357
## Run 11 stress 0.07988437
## ... Procrustes: rmse 0.0009457758 max resid 0.01490557
## Run 12 stress 0.07988422
## ... Procrustes: rmse 0.0009419213 max resid 0.0148885
## Run 13 stress 0.08045545
## Run 14 stress 0.0798864
## ... Procrustes: rmse 0.007123702 max resid 0.1213038
## Run 15 stress 0.0982179
## Run 16 stress 0.07996887
## ... Procrustes: rmse 0.001876861 max resid 0.01507729
## Run 17 stress 0.08728315
## Run 18 stress 0.08024228
## ... Procrustes: rmse 0.01457339 max resid 0.04016334
## Run 19 stress 0.08045557
## Run 20 stress 0.07988799
## ... Procrustes: rmse 0.007115218 max resid 0.121263
## Run 21 stress 0.0798628
## ... New best solution
## ... Procrustes: rmse 0.007069691 max resid 0.1212877
## Run 22 stress 0.08045547
## Run 23 stress 0.08072861
## Run 24 stress 0.0806581
## Run 25 stress 0.08024989
## ... Procrustes: rmse 0.01619142 max resid 0.1222392
## Run 26 stress 0.09989734
## Run 27 stress 0.08045547
## Run 28 stress 0.08024224
## ... Procrustes: rmse 0.01618287 max resid 0.1222275
## Run 29 stress 0.08062659
## Run 30 stress 0.08071295
## Run 31 stress 0.07986264
## ... New best solution
## ... Procrustes: rmse 6.415693e-05 max resid 0.0006803248
## ... Similar to previous best
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07555323
## Run 1 stress 0.07839043
## Run 2 stress 0.07555284
## ... New best solution
## ... Procrustes: rmse 0.000140649 max resid 0.00191261
## ... Similar to previous best
## Run 3 stress 0.07912022
## Run 4 stress 0.07559483
## ... Procrustes: rmse 0.001328983 max resid 0.01975423
## Run 5 stress 0.07912023
## Run 6 stress 0.0790603
## Run 7 stress 0.07844726
## Run 8 stress 0.07839059
## Run 9 stress 0.0755949
## ... Procrustes: rmse 0.001329374 max resid 0.01971916
## Run 10 stress 0.07555292
## ... Procrustes: rmse 3.01317e-05 max resid 0.0003801015
## ... Similar to previous best
## Run 11 stress 0.07839021
## Run 12 stress 0.08168856
## Run 13 stress 0.07555289
## ... Procrustes: rmse 2.873e-05 max resid 0.0004161298
## ... Similar to previous best
## Run 14 stress 0.07555311
## ... Procrustes: rmse 0.0001209194 max resid 0.001869606
## ... Similar to previous best
## Run 15 stress 0.07559483
## ... Procrustes: rmse 0.001322217 max resid 0.01989918
## Run 16 stress 0.07580512
## ... Procrustes: rmse 0.002025722 max resid 0.0340366
## Run 17 stress 0.0784211
## Run 18 stress 0.07839053
## Run 19 stress 0.07906021
## Run 20 stress 0.07966937
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07631092
## Run 1 stress 0.1023888
## Run 2 stress 0.07644609
## ... Procrustes: rmse 0.008136074 max resid 0.1318074
## Run 3 stress 0.0763005
## ... New best solution
## ... Procrustes: rmse 0.002844701 max resid 0.01619615
## Run 4 stress 0.08251427
## Run 5 stress 0.07682954
## Run 6 stress 0.1085956
## Run 7 stress 0.0930297
## Run 8 stress 0.08324626
## Run 9 stress 0.07769546
## Run 10 stress 0.07682951
## Run 11 stress 0.09494386
## Run 12 stress 0.09558671
## Run 13 stress 0.07633401
## ... Procrustes: rmse 0.002432128 max resid 0.0147198
## Run 14 stress 0.07682951
## Run 15 stress 0.08912079
## Run 16 stress 0.09284035
## Run 17 stress 0.08849264
## Run 18 stress 0.08252609
## Run 19 stress 0.08425479
## Run 20 stress 0.07769516
## Run 21 stress 0.09875642
## Run 22 stress 0.09804926
## Run 23 stress 0.077489
## Run 24 stress 0.07825414
## Run 25 stress 0.08252762
## Run 26 stress 0.07741071
## Run 27 stress 0.08909763
## Run 28 stress 0.09905859
## Run 29 stress 0.09692296
## Run 30 stress 0.08256366
## Run 31 stress 0.07643318
## ... Procrustes: rmse 0.008232382 max resid 0.1317669
## Run 32 stress 0.07648734
## ... Procrustes: rmse 0.008095048 max resid 0.1318021
## Run 33 stress 0.07734282
## Run 34 stress 0.07630926
## ... Procrustes: rmse 0.001333754 max resid 0.01340305
## Run 35 stress 0.0835959
## Run 36 stress 0.07633806
## ... Procrustes: rmse 0.002431014 max resid 0.01514084
## Run 37 stress 0.0777186
## Run 38 stress 0.08372271
## Run 39 stress 0.08358199
## Run 40 stress 0.09945157
## Run 41 stress 0.1013817
## Run 42 stress 0.0887159
## Run 43 stress 0.08816562
## Run 44 stress 0.09367683
## Run 45 stress 0.08831711
## Run 46 stress 0.08907632
## Run 47 stress 0.08968519
## Run 48 stress 0.1059581
## Run 49 stress 0.07682964
## Run 50 stress 0.07741062
## Run 51 stress 0.07643288
## ... Procrustes: rmse 0.008241621 max resid 0.1317565
## Run 52 stress 0.07817518
## Run 53 stress 0.07682964
## Run 54 stress 0.08463743
## Run 55 stress 0.0773429
## Run 56 stress 0.0945852
## Run 57 stress 0.07817481
## Run 58 stress 0.08274334
## Run 59 stress 0.08376885
## Run 60 stress 0.0827057
## Run 61 stress 0.09475385
## Run 62 stress 0.08251025
## Run 63 stress 0.07734274
## Run 64 stress 0.08304635
## Run 65 stress 0.07689318
## Run 66 stress 0.0839375
## Run 67 stress 0.07776996
## Run 68 stress 0.0889937
## Run 69 stress 0.07741296
## Run 70 stress 0.07638979
## ... Procrustes: rmse 0.00177121 max resid 0.01433156
## Run 71 stress 0.0934456
## Run 72 stress 0.07741073
## Run 73 stress 0.08781264
## Run 74 stress 0.100469
## Run 75 stress 0.1015082
## Run 76 stress 0.08321771
## Run 77 stress 0.08917265
## Run 78 stress 0.1029513
## Run 79 stress 0.07827608
## Run 80 stress 0.09328375
## Run 81 stress 0.08372603
## Run 82 stress 0.09383028
## Run 83 stress 0.09812777
## Run 84 stress 0.07632774
## ... Procrustes: rmse 0.0008455678 max resid 0.01428012
## Run 85 stress 0.08234673
## Run 86 stress 0.08384494
## Run 87 stress 0.103826
## Run 88 stress 0.07641862
## ... Procrustes: rmse 0.007825386 max resid 0.1318511
## Run 89 stress 0.09491498
## Run 90 stress 0.07734271
## Run 91 stress 0.09126206
## Run 92 stress 0.1016094
## Run 93 stress 0.07817461
## Run 94 stress 0.07682942
## Run 95 stress 0.08358892
## Run 96 stress 0.07631069
## ... Procrustes: rmse 0.00280624 max resid 0.0160503
## Run 97 stress 0.08400442
## Run 98 stress 0.08593592
## Run 99 stress 0.0842892
## Run 100 stress 0.08748081
## *** No convergence -- monoMDS stopping criteria:
## 21: no. of iterations >= maxit
## 59: stress ratio > sratmax
## 20: scale factor of the gradient < sfgrmin
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07793496
## Run 1 stress 0.08345573
## Run 2 stress 0.07950305
## Run 3 stress 0.07793498
## ... Procrustes: rmse 8.656896e-05 max resid 0.00118858
## ... Similar to previous best
## Run 4 stress 0.08338931
## Run 5 stress 0.08255836
## Run 6 stress 0.08246408
## Run 7 stress 0.08253613
## Run 8 stress 0.07825041
## ... Procrustes: rmse 0.007189571 max resid 0.1229027
## Run 9 stress 0.07793476
## ... New best solution
## ... Procrustes: rmse 7.0321e-05 max resid 0.000914181
## ... Similar to previous best
## Run 10 stress 0.0779349
## ... Procrustes: rmse 5.128571e-05 max resid 0.0006044514
## ... Similar to previous best
## Run 11 stress 0.07793495
## ... Procrustes: rmse 9.301436e-05 max resid 0.0008152133
## ... Similar to previous best
## Run 12 stress 0.08345556
## Run 13 stress 0.07793488
## ... Procrustes: rmse 4.74076e-05 max resid 0.0004324236
## ... Similar to previous best
## Run 14 stress 0.07825371
## ... Procrustes: rmse 0.007201844 max resid 0.1230166
## Run 15 stress 0.0794033
## Run 16 stress 0.07825052
## ... Procrustes: rmse 0.007188911 max resid 0.1228933
## Run 17 stress 0.07793497
## ... Procrustes: rmse 5.537042e-05 max resid 0.0005773593
## ... Similar to previous best
## Run 18 stress 0.08236959
## Run 19 stress 0.07964811
## Run 20 stress 0.07793501
## ... Procrustes: rmse 9.804391e-05 max resid 0.0009594424
## ... Similar to previous best
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07410432
## Run 1 stress 0.07711579
## Run 2 stress 0.07711901
## Run 3 stress 0.07361072
## ... New best solution
## ... Procrustes: rmse 0.008233704 max resid 0.1414918
## Run 4 stress 0.07409984
## ... Procrustes: rmse 0.00822779 max resid 0.1414896
## Run 5 stress 0.07719168
## Run 6 stress 0.07410053
## ... Procrustes: rmse 0.008225887 max resid 0.1415367
## Run 7 stress 0.07715765
## Run 8 stress 0.07409982
## ... Procrustes: rmse 0.008231171 max resid 0.1415466
## Run 9 stress 0.07360977
## ... New best solution
## ... Procrustes: rmse 0.0002849343 max resid 0.003999329
## ... Similar to previous best
## Run 10 stress 0.07360991
## ... Procrustes: rmse 0.0001051132 max resid 0.001238871
## ... Similar to previous best
## Run 11 stress 0.07410002
## ... Procrustes: rmse 0.008227229 max resid 0.1415564
## Run 12 stress 0.07719853
## Run 13 stress 0.07409995
## ... Procrustes: rmse 0.008242844 max resid 0.141839
## Run 14 stress 0.07361075
## ... Procrustes: rmse 0.0002670841 max resid 0.003997129
## ... Similar to previous best
## Run 15 stress 0.07409997
## ... Procrustes: rmse 0.008227051 max resid 0.1415342
## Run 16 stress 0.07362478
## ... Procrustes: rmse 0.0005759871 max resid 0.006121655
## ... Similar to previous best
## Run 17 stress 0.07711067
## Run 18 stress 0.07361431
## ... Procrustes: rmse 0.0004719 max resid 0.005248818
## ... Similar to previous best
## Run 19 stress 0.07361077
## ... Procrustes: rmse 0.0002639393 max resid 0.003997699
## ... Similar to previous best
## Run 20 stress 0.07711404
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07623957
## Run 1 stress 0.07732742
## Run 2 stress 0.07840935
## Run 3 stress 0.07624034
## ... Procrustes: rmse 0.0003768013 max resid 0.004371012
## ... Similar to previous best
## Run 4 stress 0.07780323
## Run 5 stress 0.07647468
## ... Procrustes: rmse 0.007730161 max resid 0.1257849
## Run 6 stress 0.07623998
## ... Procrustes: rmse 0.0001125424 max resid 0.001191911
## ... Similar to previous best
## Run 7 stress 0.07625234
## ... Procrustes: rmse 0.001765868 max resid 0.02439018
## Run 8 stress 0.0762394
## ... New best solution
## ... Procrustes: rmse 0.0001782712 max resid 0.0027822
## ... Similar to previous best
## Run 9 stress 0.07835936
## Run 10 stress 0.0779857
## Run 11 stress 0.08035475
## Run 12 stress 0.07835903
## Run 13 stress 0.07780233
## Run 14 stress 0.07907699
## Run 15 stress 0.07625185
## ... Procrustes: rmse 0.001880307 max resid 0.02677984
## Run 16 stress 0.07647406
## ... Procrustes: rmse 0.007391574 max resid 0.1255328
## Run 17 stress 0.07736039
## Run 18 stress 0.07647404
## ... Procrustes: rmse 0.007386053 max resid 0.1255103
## Run 19 stress 0.07860925
## Run 20 stress 0.07860893
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07486869
## Run 1 stress 0.0840421
## Run 2 stress 0.07509062
## ... Procrustes: rmse 0.007326503 max resid 0.124849
## Run 3 stress 0.07528103
## ... Procrustes: rmse 0.008212688 max resid 0.1250903
## Run 4 stress 0.07528102
## ... Procrustes: rmse 0.008211582 max resid 0.1250938
## Run 5 stress 0.07511004
## ... Procrustes: rmse 0.02191143 max resid 0.1355159
## Run 6 stress 0.07606211
## Run 7 stress 0.08404329
## Run 8 stress 0.08749215
## Run 9 stress 0.07528218
## ... Procrustes: rmse 0.0208214 max resid 0.1360732
## Run 10 stress 0.07548026
## Run 11 stress 0.07511024
## ... Procrustes: rmse 0.02191647 max resid 0.1355547
## Run 12 stress 0.075038
## ... Procrustes: rmse 0.01977501 max resid 0.1359459
## Run 13 stress 0.08001672
## Run 14 stress 0.07509046
## ... Procrustes: rmse 0.007330819 max resid 0.1248913
## Run 15 stress 0.09126092
## Run 16 stress 0.07662499
## Run 17 stress 0.08040065
## Run 18 stress 0.08797994
## Run 19 stress 0.08437686
## Run 20 stress 0.08540266
## Run 21 stress 0.07622111
## Run 22 stress 0.07585444
## Run 23 stress 0.07585457
## Run 24 stress 0.07520714
## ... Procrustes: rmse 0.003359117 max resid 0.05437184
## Run 25 stress 0.07512172
## ... Procrustes: rmse 0.0178145 max resid 0.04563828
## Run 26 stress 0.07512169
## ... Procrustes: rmse 0.01780696 max resid 0.04562928
## Run 27 stress 0.07990891
## Run 28 stress 0.07733065
## Run 29 stress 0.07994523
## Run 30 stress 0.07511296
## ... Procrustes: rmse 0.0205105 max resid 0.1360655
## Run 31 stress 0.08397326
## Run 32 stress 0.0769825
## Run 33 stress 0.07673398
## Run 34 stress 0.08001552
## Run 35 stress 0.08310472
## Run 36 stress 0.08404609
## Run 37 stress 0.07955759
## Run 38 stress 0.07990881
## Run 39 stress 0.0748614
## ... New best solution
## ... Procrustes: rmse 0.0005570582 max resid 0.00874272
## ... Similar to previous best
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07612839
## Run 1 stress 0.07600935
## ... New best solution
## ... Procrustes: rmse 0.01903209 max resid 0.06201078
## Run 2 stress 0.07932812
## Run 3 stress 0.07600948
## ... Procrustes: rmse 9.157001e-05 max resid 0.001167921
## ... Similar to previous best
## Run 4 stress 0.07601609
## ... Procrustes: rmse 0.0005617365 max resid 0.008495579
## ... Similar to previous best
## Run 5 stress 0.076338
## ... Procrustes: rmse 0.02004997 max resid 0.07463981
## Run 6 stress 0.07600931
## ... New best solution
## ... Procrustes: rmse 5.687208e-05 max resid 0.0006066982
## ... Similar to previous best
## Run 7 stress 0.0775551
## Run 8 stress 0.07601587
## ... Procrustes: rmse 0.0005422756 max resid 0.008353705
## ... Similar to previous best
## Run 9 stress 0.07938979
## Run 10 stress 0.07600944
## ... Procrustes: rmse 0.0001472 max resid 0.001263037
## ... Similar to previous best
## Run 11 stress 0.07612653
## ... Procrustes: rmse 0.01893282 max resid 0.06235134
## Run 12 stress 0.07612689
## ... Procrustes: rmse 0.01889343 max resid 0.06234841
## Run 13 stress 0.07645635
## ... Procrustes: rmse 0.02106352 max resid 0.1375658
## Run 14 stress 0.07642181
## ... Procrustes: rmse 0.02049783 max resid 0.1369162
## Run 15 stress 0.07680739
## Run 16 stress 0.07710757
## Run 17 stress 0.07611202
## ... Procrustes: rmse 0.01922679 max resid 0.06808445
## Run 18 stress 0.07645897
## ... Procrustes: rmse 0.02106282 max resid 0.1376663
## Run 19 stress 0.07600948
## ... Procrustes: rmse 7.983302e-05 max resid 0.0008935215
## ... Similar to previous best
## Run 20 stress 0.0760095
## ... Procrustes: rmse 4.765778e-05 max resid 0.0004116567
## ... Similar to previous best
## *** Solution reached
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.08082005
## Run 1 stress 0.08157007
## Run 2 stress 0.08082031
## ... Procrustes: rmse 0.0001665894 max resid 0.001951793
## ... Similar to previous best
## Run 3 stress 0.08295016
## Run 4 stress 0.08242176
## Run 5 stress 0.08242095
## Run 6 stress 0.08082058
## ... Procrustes: rmse 0.0001748907 max resid 0.002098015
## ... Similar to previous best
## Run 7 stress 0.08082029
## ... Procrustes: rmse 0.0001740461 max resid 0.002607949
## ... Similar to previous best
## Run 8 stress 0.08082105
## ... Procrustes: rmse 0.0002226545 max resid 0.002531585
## ... Similar to previous best
## Run 9 stress 0.08413661
## Run 10 stress 0.08192203
## Run 11 stress 0.08242066
## Run 12 stress 0.08156996
## Run 13 stress 0.08082032
## ... Procrustes: rmse 0.0001411719 max resid 0.001464453
## ... Similar to previous best
## Run 14 stress 0.08192172
## Run 15 stress 0.0829602
## Run 16 stress 0.08190382
## Run 17 stress 0.08295041
## Run 18 stress 0.08242062
## Run 19 stress 0.08082212
## ... Procrustes: rmse 0.0002551116 max resid 0.003001193
## ... Similar to previous best
## Run 20 stress 0.08157017
## *** Solution reached
We analyzed stroke data using the Kaggle Stroke Prediction Dataset, which contains 12 attributes, 11 being features and the final being the target column of stroke occurrence. We first began by loading and viewing the data.
stroke_data <- read_csv("healthcare-dataset-stroke-data.csv")
head(stroke_data[c(1:10)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi |
|---|---|---|---|---|---|---|---|---|---|
| 9046 | Male | 67 | 0 | 1 | Yes | Private | Urban | 228.69 | 36.6 |
| 51676 | Female | 61 | 0 | 0 | Yes | Self-employed | Rural | 202.21 | N/A |
| 31112 | Male | 80 | 0 | 1 | Yes | Private | Rural | 105.92 | 32.5 |
| 60182 | Female | 49 | 0 | 0 | Yes | Private | Urban | 171.23 | 34.4 |
| 1665 | Female | 79 | 1 | 0 | Yes | Self-employed | Rural | 174.12 | 24 |
| 56669 | Male | 81 | 0 | 0 | Yes | Private | Urban | 186.21 | 29 |
head(stroke_data[c(11:12)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| smoking_status | stroke |
|---|---|
| formerly smoked | 1 |
| never smoked | 1 |
| never smoked | 1 |
| smokes | 1 |
| never smoked | 1 |
| formerly smoked | 1 |
We then cleaned the data by one-hot encoding the categorical columns and removing rows which contained NA values.
stroke_data$bmi <- as.numeric(stroke_data$bmi)
stroke_data$smoking_status <- na_if(stroke_data$smoking_status, "Unknown")
no_na_stroke_data <- drop_na(stroke_data)
no_na_stroke_data <- select(no_na_stroke_data, -id)
dummies <- dummyVars(" ~ .", data=no_na_stroke_data)
encoded <- data.frame(predict(dummies, newdata=no_na_stroke_data))
Next, we viewed the updated table.
head(encoded[c(1:8)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| genderFemale | genderMale | genderOther | age | hypertension | heart_disease | ever_marriedNo | ever_marriedYes |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 67 | 0 | 1 | 0 | 1 |
| 0 | 1 | 0 | 80 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 49 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 79 | 1 | 0 | 0 | 1 |
| 0 | 1 | 0 | 81 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 74 | 1 | 1 | 0 | 1 |
head(encoded[c(9:12)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| work_typechildren | work_typeGovt_job | work_typeNever_worked | work_typePrivate |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 1 |
head(encoded[c(13:18)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| work_typeSelf.employed | Residence_typeRural | Residence_typeUrban | avg_glucose_level | bmi | smoking_statusformerly.smoked |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 228.69 | 36.6 | 1 |
| 0 | 1 | 0 | 105.92 | 32.5 | 0 |
| 0 | 0 | 1 | 171.23 | 34.4 | 0 |
| 1 | 1 | 0 | 174.12 | 24.0 | 0 |
| 0 | 0 | 1 | 186.21 | 29.0 | 1 |
| 0 | 1 | 0 | 70.09 | 27.4 | 0 |
head(encoded[c(19:21)]) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| smoking_statusnever.smoked | smoking_statussmokes | stroke |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 0 | 0 | 1 |
| 1 | 0 | 1 |
We then viewed how the number of observations associated with strokes and not associated with strokes changed after removing rows with NA values.
table(stroke_data$stroke) %>%
kbl(col.names = c("Class", "Frequency")) %>%
kable_styling(c("striped", "hover"))
| Class | Frequency |
|---|---|
| 0 | 4861 |
| 1 | 249 |
table(no_na_stroke_data$stroke) %>%
kbl(col.names = c("Class", "Frequency")) %>%
kable_styling(c("striped", "hover"))
| Class | Frequency |
|---|---|
| 0 | 3246 |
| 1 | 180 |
Noting that the decrease in values was not especially large and affected both classes roughly equally, we were satisfied with this method for processing our data and handling NA values.
summary(no_na_stroke_data)
## gender age hypertension heart_disease
## Length:3426 Min. :10.00 Min. :0.0000 Min. :0.00000
## Class :character 1st Qu.:34.00 1st Qu.:0.0000 1st Qu.:0.00000
## Mode :character Median :50.00 Median :0.0000 Median :0.00000
## Mean :48.65 Mean :0.1191 Mean :0.06013
## 3rd Qu.:63.00 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :82.00 Max. :1.0000 Max. :1.00000
## ever_married work_type Residence_type avg_glucose_level
## Length:3426 Length:3426 Length:3426 Min. : 55.12
## Class :character Class :character Class :character 1st Qu.: 77.24
## Mode :character Mode :character Mode :character Median : 92.36
## Mean :108.32
## 3rd Qu.:116.21
## Max. :271.74
## bmi smoking_status stroke
## Min. :11.50 Length:3426 Min. :0.00000
## 1st Qu.:25.30 Class :character 1st Qu.:0.00000
## Median :29.10 Mode :character Median :0.00000
## Mean :30.29 Mean :0.05254
## 3rd Qu.:34.10 3rd Qu.:0.00000
## Max. :92.00 Max. :1.00000
ggpairs (no_na_stroke_data,
lower = list(
continuous = "smooth"),
axisLabels ="none",
switch = 'both')
#By viewing the distribution, we can gain a better understanding of which variables may be correlated to each other.
summary(no_na_stroke_data)
## gender age hypertension heart_disease
## Length:3426 Min. :10.00 Min. :0.0000 Min. :0.00000
## Class :character 1st Qu.:34.00 1st Qu.:0.0000 1st Qu.:0.00000
## Mode :character Median :50.00 Median :0.0000 Median :0.00000
## Mean :48.65 Mean :0.1191 Mean :0.06013
## 3rd Qu.:63.00 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :82.00 Max. :1.0000 Max. :1.00000
## ever_married work_type Residence_type avg_glucose_level
## Length:3426 Length:3426 Length:3426 Min. : 55.12
## Class :character Class :character Class :character 1st Qu.: 77.24
## Mode :character Mode :character Mode :character Median : 92.36
## Mean :108.32
## 3rd Qu.:116.21
## Max. :271.74
## bmi smoking_status stroke
## Min. :11.50 Length:3426 Min. :0.00000
## 1st Qu.:25.30 Class :character 1st Qu.:0.00000
## Median :29.10 Mode :character Median :0.00000
## Mean :30.29 Mean :0.05254
## 3rd Qu.:34.10 3rd Qu.:0.00000
## Max. :92.00 Max. :1.00000
#The mean "average glucose" is at the prediabetes level but the median is less than the prediabetes level. BMI median and mean are at the higher end of the range denoting overweight (25-29.9).
ggpairs (no_na_stroke_data,
lower = list(
continuous = "smooth"),
axisLabels ="none",
switch = 'both')
#Observations:
#Age is correlated to hypertension, heart disease, average glucose, and stroke.
#Heart disease is correlated with average glucose and stroke.
#Hypertension is correlated to heart disease, average glucose, BMI, and stroke.
#I considered splitting the data into age group buckets because of the distribution of the ages. However, it may be disadvantageous to separate the data further because there are few cases of strokes in the datasets.
hist(no_na_stroke_data$stroke, xlab = "0=No Stroke, 1=Stroke", col = 'blue', main = "Stroke Distribution")
no_na_stroke_data %>% count(stroke)
## # A tibble: 2 x 2
## stroke n
## <dbl> <int>
## 1 0 3246
## 2 1 180
#based on dplyr
#There are more cases of not having stroke than having stroke. Therefore, I am using logistic regression to account for this non-normal distribution.
#histogram for total age data
ggplot(data=no_na_stroke_data, aes(age)) +
geom_histogram(aes(y =..density..),
breaks=seq(0, 80, by = 1),
col="black",
fill="blue",
alpha=0.2727) +
geom_density(col=1) +
ggtitle("Total Ages")
#histogram for stroke age data
ggplot(data=no_na_stroke_data, aes(age)) +
geom_histogram(aes(y =..density..),
breaks=seq(0, 80, by = 1),
col="black",
fill="green",
alpha=.2) +
geom_density(col=1) + ylim(0,0.1) +
ggtitle("Ages of Stroke Positive Individuals")
It appears that a broad range of ages were included in this study. As seen in the second histogram, the data is significantly skewed left, meaning that individuals positive for stroke were on average of higher age (around 40 to 80 years old). These findings are consistent with knowledge already known about aging and stroke risk, that being that increased age comes with a significantly elevated risk for stroke development.
#histogram for total average glucose level
ggplot(data=no_na_stroke_data, aes(avg_glucose_level)) +
geom_histogram(aes(y =..density..),
breaks=seq(55, 280, by = 3),
col="black",
fill="pink",
alpha=0.2727) +
geom_density(col=1) +
ggtitle("Total Average Glucose Levels")
#histogram for stroke average glucose level
ggplot(data=no_na_stroke_data, aes(avg_glucose_level)) +
geom_histogram(aes(y =..density..),
breaks=seq(55, 280, by = 3),
col="black",
fill="red",
alpha=.2) +
geom_density(col=1) +
ggtitle("Average Glucose Levels of Stroke Positive Individuals")
From this graph, we can see how although many positive stroke cases presented with normal blood glucose levels, there were also several positive stroke cases that exhibited elevated glucose levels. These elevated levels could be correlated to presence of diabetes, which is a disease known to directly increase stroke risk. However, this variable was not included in the dataset, so this correlation cannot be investigated using this data.
ggplot(data = no_na_stroke_data, aes(x = hypertension)) +
geom_bar(fill="lightblue") +
ggtitle("Hypertension in Stroke Positive Patients")
This figure shows that more stroke positive individuals do not have hypertension, although hypertension is known to increase stroke risk. This could be due to the fact that there is a very small sample size of stroke positive patients. A larger sample size may provide more accurate results.
ggplot(data = no_na_stroke_data, aes(x = heart_disease)) +
geom_bar(fill="orange") +
ggtitle("Heart Disease in Stroke Positive Patients")
The same is found that more stroke positive patients did not have heart disease, even though this variable is correlated to increased stroke risk. Increased sample size may provide different results.
We first split the data into train and test sets, sampling separately between the subset of rows in the dataset which were associated with strokes and those which were not associated with strokes in order to maintain the same proportion of examples in the training and test sets. We used 30% of the full dataset for the test set and the remaining 70% for the training set.
encoded_nostroke <- subset(encoded, stroke == 0)
encoded_stroke <- subset(encoded, stroke == 1)
nostroke_n <- nrow(encoded_nostroke)
stroke_n <- nrow(encoded_stroke)
set.seed(100)
valTestIndex_stroke <- sample(1:stroke_n, size = round(0.3*stroke_n), replace = FALSE)
valTestIndex_nostroke <- sample(1:nostroke_n, size = round(0.3*nostroke_n), replace = FALSE)
train_stroke <- encoded_stroke[-valTestIndex_stroke ,]
train_nostroke <- encoded_nostroke[-valTestIndex_nostroke ,]
test_stroke <- encoded_stroke[valTestIndex_stroke ,]
test_nostroke <- encoded_nostroke[-valTestIndex_nostroke ,]
train <- rbind(train_stroke, train_nostroke)
train_X <- select(train, -stroke)
train_y <- select(train, stroke)
test <- rbind(test_stroke, test_nostroke)
test_X <- select(test, -stroke)
test_y <- select(test, stroke)
We then pre-processed the data by subtracting the mean of each column from each value and dividing by the standard deviation of each column. We fit this transformation on the train set and applied it to both the train and test sets to avoid information leakage between the sets.
preprocessor <- preProcess(train[, c(4, 16, 17)], method=c("center", "scale"))
normalized_train <- predict(preprocessor, train[, c(4, 16, 17)])
normalized_test <- predict(preprocessor, test_X[, c(4, 16, 17)])
train[, c(4, 16, 17)] <- normalized_train
test_X[, c(4, 16, 17)] <- normalized_test
We then trained a simple two-layer neural network to learn to classify inputted vectors of the health data as either being associated with a person who had a stroke or with a person who did not have a stroke. Essentially, a neural network learns values for weights and biases to multiply and add to inputted health information by in order to minimize its loss, or error in making a prediction. We used the cross-entropy loss function for the training process and set a threshold of 0.4 to allow for convergence. The activation function used between each layer was the logistic function, which allowed the model to learn nonlinear relationships between layers. We attempted to use the popular TensorFlow/Keras neural network framework for creating more complex neural network structures, but found it to be too difficult to run in the RStudio Cloud environment given the python backend TensorFlow/Keras uses. We instead used the more simplistic “neuralnet” package in R. Given that the neural network training process is non-deterministic, we set a seed before training the model to ensure reproducibility.
set.seed(123)
model <- neuralnet(stroke~., train, c(5, 5), err.fct = "ce", act.fct = "logistic", linear.output = FALSE, threshold = 0.4)
We can then plot the neural network to better understand the flow of information through the model and to visualize the learned weights and biases following the training process.
plot(model, rep = "best")
We then used the trained model to predict on the test set.
predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
predictions <- as.numeric(predictions)
preds_and_results <- data.frame("True" = test_y, "Predicted" = predictions)
Next, we evaluated the performance of the model on the test set using an array of metrics, however our primary metric used is the balanced accuracy score. This value is the average of the model’s accuracy in correctly classifying inputted vectors with not being associated with strokes as well as in correctly classifying vectors as being associated with strokes. This metric then gives equal weight to correctly classifying the positive and negative class, which is useful in evaluating performance on datasets with significant imbalances between the negative and positive classes.
eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC |
|---|---|---|---|---|---|---|---|
| 0.536817 | 0.9780739 | 0.1355932 | 0.0740741 | 0.9995599 | 0.8 | 0.9784576 | 0.536817 |
We find reasonable accuracy of the model, yet large discrepancies between the positive predictive value and the negative predictive value. We later attempt to oversample the data using the Synthetic Minority-Oversampling Technique (SMOTE) to address this problem.
We then plotted a confusion matrix to better assess the predictions of the model.
confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)
We find, as suspected, that the model has a bias in classifying inputted health information as not being associated with a stroke given the large imbalance in the dataset, resulting in a significant amount of false negatives in relation to false positives. We address this in the next section
We used the SMOTE algorithm to generate synthetic yet realistic examples in the training set for the minority class in order to help equalize the class imbalance. This algorithm works by drawing lines between nearby values of the minority class in the vector space and by populating those lines with data given the minority class label, thereby generating new, realistic data values.
We begin by selecting the features and labels independently from the training data.
train_X <- select(train, -stroke)
train_y <- select(train, stroke)
We verify the imbalanced class distribution.
class_distribution <- table(train_y)
class_distribution %>%
kbl(col.names = c("Class", "Frequency")) %>%
kable_styling(c("striped", "hover"))
| Class | Frequency |
|---|---|
| 0 | 2272 |
| 1 | 126 |
We then set a seed for reproducibility and use SMOTE to oversample such that the number of examples for the majority and minority classes are roughly equal.
set.seed(100)
oversampled_train <- SMOTE(train_X, train_y, dup_size = class_distribution[1]/class_distribution[2])
We then see that the algorithm succeeds in equalizing the class distributions.
oversampled_train <- oversampled_train$data
oversampled_train$class <- as.numeric(oversampled_train$class)
table(oversampled_train$class) %>%
kbl(col.names = c("Class", "Frequency")) %>%
kable_styling(c("striped", "hover"))
| Class | Frequency |
|---|---|
| 0 | 2272 |
| 1 | 2394 |
We now train a neural network with the same architecture, loss function, and activation function on the oversampled training data.
remove(model)
set.seed(123)
model <- neuralnet(class~., oversampled_train, c(5, 5), err.fct = "ce", act.fct = "logistic", linear.output = FALSE, threshold = 0.8)
We again the plot the model to visualize the structure as well as the weights and biases.
plot(model, rep = "best")
Again we predict on the test set to gain an objective measure of the model’s accuracy.
predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
predictions <- as.numeric(predictions)
preds_and_results <- data.frame("True" = test_y, "Predicted" = predictions)
We again evaluate the performance of the neural network with the same metrics.
eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC |
|---|---|---|---|---|---|---|---|
| 0.6284559 | 0.8215821 | 0.0997831 | 0.4259259 | 0.8309859 | 0.0565111 | 0.9838458 | 0.6284559 |
We see that the balanced accuracy score has somewhat increased while overall accuracy decreased, suggesting an increase in the accuracy of classifying true positives and a decrease in the accuracy of classifying true negatives.
We again plot a confusion matrix to evaluate the distribution of the model’s classifications in relation to their true values.
confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)
We note, as expected, an uptick in the accuracy of classifying true positives and a drop in the accuracy of classifying true negatives.
Given more time, we may have experimented with other techniques for accounting for class imbalance in our dataset, such as by adjusting our threshold, undersampling, or weighting minority examples more heavily in the training process.
We then used a K-nearest neighbor algorithm on the oversampled dataset to assess accuracy, first selecting the feature and target groups.
over_train_X <- select(oversampled_train, -class)
over_train_y <- select(oversampled_train, class)
We then ran the algorithm with a K value equal to 1. Given more time we would have split our dataset into three sets and performed hyperparameter tuning to determine the optimal value of K for our problem.
predictions <- knn(train = over_train_X, k = 1, test = test_X, cl = over_train_y[, 1])
We then processed the predictions of the knn algorithm.
remove(preds_and_results)
preds_and_results <- tibble(test_y, "Predicted" = predictions)
preds_and_results$Predicted <- as.numeric(as.character(preds_and_results$Predicted))
Finally, we evaluated the performance of the algorithm on our dataset.
eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC |
|---|---|---|---|---|---|---|---|
| 0.5648148 | 0.9797936 | 0.2295082 | 0.1296296 | 1 | 1 | 0.9797326 | 0.5648148 |
We see that the balanced accuracy score is lower than that of the neural network trained on oversampled data.
We again better evaluate classification accuracy with a confusion matrix.
confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)
We see again our original issue, with a high percentage of true negatives, yet a very low percentage of true positives. Overall, we determine a neural network trained on oversampled data, given the complexity and flexibility of the algorithm, to be the most effective model for correctly classifying people as either having had or not having had strokes of those models yet evaluated.
#The binary logistic regression model is used because the stroke data only has values of 0 and 1 and the distribution is non-normal. This model is simpler and may run faster than the k-nearest neighbor. However, the disadvantage is that the model does not have the neural network’s capability of complex representation learning.
newmodel <- glm(class~., family = binomial, data=oversampled_train)
#The oversampled train data allows the "stroke = 1" observations to be equal to the "stroke = 0" which may improve the accuracy of the model. Class signifies stroke.
summary(newmodel)
##
## Call:
## glm(formula = class ~ ., family = binomial, data = oversampled_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6962 -0.6986 0.3578 0.7448 2.1749
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.335e+13 3.127e+13 -0.427 0.669512
## genderFemale 1.335e+13 3.127e+13 0.427 0.669512
## genderMale 1.335e+13 3.127e+13 0.427 0.669512
## genderOther 1.335e+13 3.127e+13 0.427 0.669512
## age 1.761e+00 6.463e-02 27.244 < 2e-16 ***
## hypertension 8.609e-01 1.027e-01 8.380 < 2e-16 ***
## heart_disease 6.654e-01 1.371e-01 4.855 1.20e-06 ***
## ever_marriedNo -6.045e-01 1.466e-01 -4.122 3.75e-05 ***
## ever_marriedYes NA NA NA NA
## work_typechildren -2.060e+01 5.329e+04 0.000 0.999692
## work_typeGovt_job 5.046e-01 1.356e-01 3.721 0.000199 ***
## work_typeNever_worked -2.113e+01 1.130e+05 0.000 0.999851
## work_typePrivate 8.901e-01 1.018e-01 8.740 < 2e-16 ***
## work_typeSelf.employed NA NA NA NA
## Residence_typeRural -1.165e-01 7.781e-02 -1.497 0.134327
## Residence_typeUrban NA NA NA NA
## avg_glucose_level 4.178e-02 3.603e-02 1.160 0.246241
## bmi 1.007e-01 4.606e-02 2.187 0.028751 *
## smoking_statusformerly.smoked -4.390e-02 1.167e-01 -0.376 0.706830
## smoking_statusnever.smoked -4.204e-02 1.101e-01 -0.382 0.702549
## smoking_statussmokes NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6465.3 on 4665 degrees of freedom
## Residual deviance: 4339.1 on 4649 degrees of freedom
## AIC: 4373.1
##
## Number of Fisher Scoring iterations: 25
#Observations (+) is positive correlation with stroke and (-) is negative correlation with stroke.
#Independent Variables with p<0.01: female (+), age (+), hypertension (+), not married (+), average glucose (+), bmi (+), formerly smoked (-), never smoked (-)
#Independent Variables with 0.01<p<0.05: #government job (-), rural residence (-)
#test_X contains the independent variables and test_Y contains the dependent variable, stroke, of the validation sets.
predictTest <- predict(newmodel, test_X, type = "response")
classification <- ifelse(predictTest > 0.5, 1,0)
#In this case, 0.5 is our threshold.
classification <- as.numeric(classification)
#Prediction Results and Confusion Matrix
preds_and_results <- data.frame("True" = test_y, "Predicted" = classification)
eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
kbl() %>%
kable_styling(c("striped", "hover"))
| Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC |
|---|---|---|---|---|---|---|---|
| 0.703288 | 0.7205503 | 0.1022099 | 0.6851852 | 0.7213908 | 0.0552239 | 0.9897343 | 0.703288 |
confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)
#1.7% of stroke patients were predicted and 71.2% of non-stroke patients were predicted. The worst inaccuracy would be if a person is predicted to not have a stroke but actually will have a stroke. In this case, 0.6% were incorrectly predicted. The inaccuracy was greater when the model predicted stroke but the person did not have a stroke (26.4%).