Contributions

Heart Disease Subgroup–

Jack worked on Heart Disease UCI Data Preprocessing, Introduction, UCI verses Framingham, and Cholesterol and Heart Rate Regression Graph

Omar contributed to preprocessing of the Framingham data, the code for training and testing sets used to run the linear model for Ten Year CHD predictions, and helped in the completion of the conclusion.

Stroke Subgroup–

Amanda Wade worked on the introduction to the stroke data as well as histograms for various variables.

Aden Eagle preprocessed and oversampled the data with SMOTE and implemented K-nearest neighbors and neural network classifiers.

Jillian Hanley worked on the correlation matrix of the stroke data and the binary logistic regression model.

Introduction

Our project focuses on stroke and heart disease and how different factors may play a role in influencing the cause of these diseases. Indicators such as blood pressure, smoking, cholesterol levels, etc. are notoriously known for being contributing factors to these diseases. But is that necessarily true or is there a more complex answer to this pressing topic? We analyzed the correlation of these variables with the occurrence of heart disease/stroke and employed various statistical tools to ascertain their significance. We also utilized machine learning algorithms to attempt to determine if a given person had a stroke or suffered from heart disease using this health information.

Which Data Set: UCI or Framingham?

Two data sets were initially chosen to represent heart disease, but after looking at the preprocessing of both data sets the Framingham data seemed to be the better choice. The Framingham data had significantly more data to draw results from and included columns like heart rate and smoking which the UCI data lacked.

We first opened and viewed the data.

heart <- read.csv("framingham.csv")

head(heart[c(1:11)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP
1 39 4 0 0 0 0 0 0 195 106.0
0 46 2 0 0 0 0 0 0 250 121.0
1 48 1 1 20 0 0 0 0 245 127.5
0 61 3 1 30 0 0 1 0 225 150.0
0 46 3 1 23 0 0 0 0 285 130.0
0 43 2 0 0 0 0 1 0 228 180.0
head(heart[c(12:16)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
diaBP BMI heartRate glucose TenYearCHD
70 26.97 80 77 0
81 28.73 95 76 0
80 25.34 75 70 0
95 28.58 65 103 1
84 23.10 85 85 0
110 30.30 77 99 0
#Removing cases with N/A values 
no_na_heart <- drop_na(heart)

We then performed a simple regression analysis between cholesterol and heart rate.

ggplot(heart) + geom_point(mapping = aes(x = totChol, y = heartRate, shape = as.factor(TenYearCHD))) + stat_smooth(method = 'lm', aes(x = totChol, y = heartRate), se = FALSE)

This regression illustrates the relationship between cholesterol and heart rate. As seen in the graph, there is a positive correlation, but more interestingly there is the cluster of points which include both those with and without heart disease. This suggests that there’s not one definitive cause to heart disease.

Therefore, we decided to examine the capacity of multiple variables to predict whether a person would develop Coronary Heart Disease in ten years from data collection (TenYearCHD).

#Encoding categorical variables
dummies <- dummyVars(" ~ .", data=no_na_heart)
encoded <- data.frame(predict(dummies, newdata=no_na_heart))

#Splitting into only train and test sets
encoded_n <- nrow(encoded)
valTestIndex <- sample(1:encoded_n, size = round(0.4*encoded_n), replace = FALSE)
train <- encoded[-valTestIndex ,]
test <- encoded[valTestIndex ,]

#Model with training dataset, predicting 10-year CHD
model <- lm(TenYearCHD ~ ., data=train)
summary(model)
## 
## Call:
## lm(formula = TenYearCHD ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67096 -0.19050 -0.11123 -0.02077  1.05504 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.5807970  0.1024672  -5.668 1.64e-08 ***
## male             0.0577204  0.0163672   3.527  0.00043 ***
## age              0.0053622  0.0010031   5.346 9.96e-08 ***
## education       -0.0070035  0.0074558  -0.939  0.34766    
## currentSmoker    0.0280603  0.0236865   1.185  0.23628    
## cigsPerDay       0.0011164  0.0010280   1.086  0.27757    
## BPMeds           0.0606048  0.0464378   1.305  0.19201    
## prevalentStroke  0.1008361  0.0938657   1.074  0.28283    
## prevalentHyp     0.0212887  0.0227128   0.937  0.34871    
## diabetes         0.0004577  0.0563292   0.008  0.99352    
## totChol          0.0003249  0.0001765   1.841  0.06577 .  
## sysBP            0.0029838  0.0006588   4.529 6.24e-06 ***
## diaBP           -0.0014686  0.0010855  -1.353  0.17622    
## BMI              0.0015924  0.0020070   0.793  0.42763    
## heartRate       -0.0007357  0.0006330  -1.162  0.24526    
## glucose          0.0010984  0.0003887   2.826  0.00476 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3473 on 2178 degrees of freedom
## Multiple R-squared:  0.09437,    Adjusted R-squared:  0.08813 
## F-statistic: 15.13 on 15 and 2178 DF,  p-value: < 2.2e-16
#Model with test dataset
model <- lm(TenYearCHD ~ ., data=test)
summary(model)
## 
## Call:
## lm(formula = TenYearCHD ~ ., data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63361 -0.18558 -0.09101  0.00582  1.05524 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.5661122  0.1243122  -4.554 5.71e-06 ***
## male             0.0569643  0.0193454   2.945  0.00329 ** 
## age              0.0097106  0.0012042   8.064 1.54e-15 ***
## education       -0.0029227  0.0087468  -0.334  0.73832    
## currentSmoker   -0.0222564  0.0283880  -0.784  0.43316    
## cigsPerDay       0.0038083  0.0012083   3.152  0.00166 ** 
## BPMeds           0.0248305  0.0528585   0.470  0.63860    
## prevalentStroke  0.2119191  0.1278638   1.657  0.09766 .  
## prevalentHyp     0.0345475  0.0274034   1.261  0.20762    
## diabetes         0.0735565  0.0717340   1.025  0.30534    
## totChol         -0.0001723  0.0002116  -0.814  0.41566    
## sysBP            0.0016867  0.0007557   2.232  0.02577 *  
## diaBP           -0.0005829  0.0012465  -0.468  0.64010    
## BMI             -0.0025116  0.0024326  -1.032  0.30201    
## heartRate        0.0001730  0.0007901   0.219  0.82675    
## glucose          0.0011334  0.0004843   2.340  0.01940 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3323 on 1446 degrees of freedom
## Multiple R-squared:  0.1226, Adjusted R-squared:  0.1135 
## F-statistic: 13.47 on 15 and 1446 DF,  p-value: < 2.2e-16

A linear model employing all the variables in the data set yielded weak predictive capability, with a multiple R^2 = 0.1028. Again, this further supports there is great variability in prediction and such a question might necessitate more complex analyses.

#remove incomplete cases
data <- heart %>% na.omit() %>% 
  select(education, age, cigsPerDay, BPMeds, prevalentStroke, prevalentHyp, diabetes, totChol, TenYearCHD, BMI, glucose, sysBP, diaBP)
#View(data)

KNN

KNN works by finding the distances between points in a data set. A k value is set, which determines a set amount of points to analyze surrounding a specific point. The specific point is classified based on the majority of the type of each point within k. K is usually set as the square root of total observations, but due to the nature of this dataset, this was not optimal. Instead, K of 5 was used since it produced the most reliable results. There were 1380 true negative cases and 5 false negative cases. There were also 232 true positives cases and 18 false posiitve cases. As shown by the graphs, there was no clear factor that contributed to an increase risk in Heart Disease.

heart = no_na_heart
num.vars <- sapply(heart, is.numeric)



set.seed(123) 
test <- 2001:3635
train.heart <- heart[-test,]
test.heart <- heart[test,]

train.def <- heart$TenYearCHD[-test]
test.def <- heart$TenYearCHD[test]

preprocessor <- preProcess(train.heart, method=c("center", "scale"))
normalized_train <- predict(preprocessor, train.heart)
normalized_test <- predict(preprocessor, test.heart)
train.heart <- normalized_train
test.heart <- normalized_test

knn.1 <-  knn(train.heart, test.heart, train.def, k =10)

table(knn.1 ,test.def)
##      test.def
## knn.1    0    1
##     0 1380   18
##     1    5  232
plot.df = data.frame(test.heart, predicted = knn.1)
names(plot.df)
##  [1] "male"            "age"             "education"       "currentSmoker"  
##  [5] "cigsPerDay"      "BPMeds"          "prevalentStroke" "prevalentHyp"   
##  [9] "diabetes"        "totChol"         "sysBP"           "diaBP"          
## [13] "BMI"             "heartRate"       "glucose"         "TenYearCHD"     
## [17] "predicted"
plot.df1 = data.frame(x = plot.df$age, 
                      y = plot.df$heartRate, 
                      predicted = plot.df$predicted)


ggplot(plot.df, aes(age, heartRate, color = predicted, fill = predicted,  alpha = predicted, shape = as.factor(prevalentHyp))) + 
  geom_point(size = 3)

ggplot(plot.df, aes(totChol, glucose, color = predicted, fill = predicted,  alpha = predicted, shape = as.factor(diabetes))) + 
  geom_point(size = 3)

ggplot(plot.df, aes(cigsPerDay, heartRate, color = predicted, fill = predicted,  alpha = predicted)) + geom_point(size = 3)

Unsupervised Learning

I hoped to use ordination to explore patterns in the data. I initially used principal componant analysis but found that the differences in scales of the variables made this data difficult to analyze in this way. Some variabals had very large number while others were binary. Non-metric multidimensional scaling is much better at handling data like this, especially since any ordination distance can be used. I tried bothe euclidean an bray-curtis distances and achieved similar results. However, I think the Bray-Curtis dissimilarity handled binary variables better. I excluded data about sex because it did not correlate well with any of the variables and allowed for better clustering.

The dataset was large and made running NMDS on the whole thing difficult. I opted to take 10 random samples of 300 and run the same NMDS analysis on all of them.I grouped according to prevalence of hypertension and found that it generally accounted for most of the individuals.

#Principle component analysis
myrda <- rda(data)

biplot(myrda, 
       display = c("species", "sites"),
       type = c("text", "points"))

#nonmetric multidimensional scaling

#This takes forever to run, but with enough time is probably possible
#heartMDS <- metaMDS(data, distance = "bray", k = 3, trymax = 100)
#stressplot(heartMDS)
#ordiplot(heartMDS)
#plot(envfit(MDS, mysample, permutations = 999))

#solution
for (x in 1:10){
  mysample <- data[sample(1:nrow(data), 300, replace=TRUE),]
  MDS <- metaMDS(mysample, distance = "bray", k = 3, trymax = 100)
  stressplot(MDS)
  ordiplot(MDS)
  plot(envfit(MDS, mysample, permutations = 999))
  ordiellipse(MDS, mysample$prevalentHyp, display = "sites", col = "orange")
}
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.09007612 
## Run 1 stress 0.1073506 
## Run 2 stress 0.08674598 
## ... New best solution
## ... Procrustes: rmse 0.02062612  max resid 0.1424498 
## Run 3 stress 0.0911402 
## Run 4 stress 0.08991935 
## Run 5 stress 0.08759387 
## Run 6 stress 0.1111169 
## Run 7 stress 0.09007544 
## Run 8 stress 0.1094545 
## Run 9 stress 0.08961089 
## Run 10 stress 0.09676924 
## Run 11 stress 0.1210348 
## Run 12 stress 0.09676082 
## Run 13 stress 0.1127997 
## Run 14 stress 0.102722 
## Run 15 stress 0.1136992 
## Run 16 stress 0.08825629 
## Run 17 stress 0.1163332 
## Run 18 stress 0.09074689 
## Run 19 stress 0.112221 
## Run 20 stress 0.1018035 
## Run 21 stress 0.09025851 
## Run 22 stress 0.09115414 
## Run 23 stress 0.08992895 
## Run 24 stress 0.09687533 
## Run 25 stress 0.09784998 
## Run 26 stress 0.09016546 
## Run 27 stress 0.08916215 
## Run 28 stress 0.09013213 
## Run 29 stress 0.1161327 
## Run 30 stress 0.08764699 
## Run 31 stress 0.09007425 
## Run 32 stress 0.09088187 
## Run 33 stress 0.08968621 
## Run 34 stress 0.112364 
## Run 35 stress 0.1113917 
## Run 36 stress 0.1084918 
## Run 37 stress 0.1195211 
## Run 38 stress 0.09700459 
## Run 39 stress 0.09683309 
## Run 40 stress 0.08742865 
## Run 41 stress 0.1089835 
## Run 42 stress 0.1175307 
## Run 43 stress 0.09108971 
## Run 44 stress 0.09118405 
## Run 45 stress 0.09065262 
## Run 46 stress 0.0890751 
## Run 47 stress 0.09014924 
## Run 48 stress 0.0913244 
## Run 49 stress 0.08985897 
## Run 50 stress 0.1200683 
## Run 51 stress 0.09037586 
## Run 52 stress 0.09477593 
## Run 53 stress 0.09115659 
## Run 54 stress 0.1221 
## Run 55 stress 0.1132699 
## Run 56 stress 0.08845211 
## Run 57 stress 0.1096282 
## Run 58 stress 0.1127558 
## Run 59 stress 0.08944135 
## Run 60 stress 0.113441 
## Run 61 stress 0.09618247 
## Run 62 stress 0.08770057 
## Run 63 stress 0.09072484 
## Run 64 stress 0.1182711 
## Run 65 stress 0.08943793 
## Run 66 stress 0.08882119 
## Run 67 stress 0.1005616 
## Run 68 stress 0.1165763 
## Run 69 stress 0.08739737 
## Run 70 stress 0.08942309 
## Run 71 stress 0.09821938 
## Run 72 stress 0.09031389 
## Run 73 stress 0.08682078 
## ... Procrustes: rmse 0.002618574  max resid 0.02539316 
## Run 74 stress 0.09772246 
## Run 75 stress 0.112889 
## Run 76 stress 0.09628904 
## Run 77 stress 0.08825686 
## Run 78 stress 0.09074472 
## Run 79 stress 0.09746435 
## Run 80 stress 0.08771056 
## Run 81 stress 0.09164397 
## Run 82 stress 0.08928741 
## Run 83 stress 0.08968452 
## Run 84 stress 0.09009723 
## Run 85 stress 0.09548798 
## Run 86 stress 0.0898584 
## Run 87 stress 0.09613614 
## Run 88 stress 0.08881778 
## Run 89 stress 0.08739631 
## Run 90 stress 0.1155659 
## Run 91 stress 0.09171939 
## Run 92 stress 0.09651834 
## Run 93 stress 0.09863562 
## Run 94 stress 0.1066123 
## Run 95 stress 0.09074072 
## Run 96 stress 0.1118559 
## Run 97 stress 0.09016561 
## Run 98 stress 0.1076461 
## Run 99 stress 0.09086455 
## Run 100 stress 0.08765776 
## *** No convergence -- monoMDS stopping criteria:
##     17: no. of iterations >= maxit
##     69: stress ratio > sratmax
##     14: scale factor of the gradient < sfgrmin

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07988605 
## Run 1 stress 0.08045534 
## Run 2 stress 0.07986423 
## ... New best solution
## ... Procrustes: rmse 0.000889752  max resid 0.01494961 
## Run 3 stress 0.07988797 
## ... Procrustes: rmse 0.007114556  max resid 0.121257 
## Run 4 stress 0.07988438 
## ... Procrustes: rmse 0.0009411556  max resid 0.01488386 
## Run 5 stress 0.08113374 
## Run 6 stress 0.08062264 
## Run 7 stress 0.08064795 
## Run 8 stress 0.08045551 
## Run 9 stress 0.08194212 
## Run 10 stress 0.08114357 
## Run 11 stress 0.07988437 
## ... Procrustes: rmse 0.0009457758  max resid 0.01490557 
## Run 12 stress 0.07988422 
## ... Procrustes: rmse 0.0009419213  max resid 0.0148885 
## Run 13 stress 0.08045545 
## Run 14 stress 0.0798864 
## ... Procrustes: rmse 0.007123702  max resid 0.1213038 
## Run 15 stress 0.0982179 
## Run 16 stress 0.07996887 
## ... Procrustes: rmse 0.001876861  max resid 0.01507729 
## Run 17 stress 0.08728315 
## Run 18 stress 0.08024228 
## ... Procrustes: rmse 0.01457339  max resid 0.04016334 
## Run 19 stress 0.08045557 
## Run 20 stress 0.07988799 
## ... Procrustes: rmse 0.007115218  max resid 0.121263 
## Run 21 stress 0.0798628 
## ... New best solution
## ... Procrustes: rmse 0.007069691  max resid 0.1212877 
## Run 22 stress 0.08045547 
## Run 23 stress 0.08072861 
## Run 24 stress 0.0806581 
## Run 25 stress 0.08024989 
## ... Procrustes: rmse 0.01619142  max resid 0.1222392 
## Run 26 stress 0.09989734 
## Run 27 stress 0.08045547 
## Run 28 stress 0.08024224 
## ... Procrustes: rmse 0.01618287  max resid 0.1222275 
## Run 29 stress 0.08062659 
## Run 30 stress 0.08071295 
## Run 31 stress 0.07986264 
## ... New best solution
## ... Procrustes: rmse 6.415693e-05  max resid 0.0006803248 
## ... Similar to previous best
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07555323 
## Run 1 stress 0.07839043 
## Run 2 stress 0.07555284 
## ... New best solution
## ... Procrustes: rmse 0.000140649  max resid 0.00191261 
## ... Similar to previous best
## Run 3 stress 0.07912022 
## Run 4 stress 0.07559483 
## ... Procrustes: rmse 0.001328983  max resid 0.01975423 
## Run 5 stress 0.07912023 
## Run 6 stress 0.0790603 
## Run 7 stress 0.07844726 
## Run 8 stress 0.07839059 
## Run 9 stress 0.0755949 
## ... Procrustes: rmse 0.001329374  max resid 0.01971916 
## Run 10 stress 0.07555292 
## ... Procrustes: rmse 3.01317e-05  max resid 0.0003801015 
## ... Similar to previous best
## Run 11 stress 0.07839021 
## Run 12 stress 0.08168856 
## Run 13 stress 0.07555289 
## ... Procrustes: rmse 2.873e-05  max resid 0.0004161298 
## ... Similar to previous best
## Run 14 stress 0.07555311 
## ... Procrustes: rmse 0.0001209194  max resid 0.001869606 
## ... Similar to previous best
## Run 15 stress 0.07559483 
## ... Procrustes: rmse 0.001322217  max resid 0.01989918 
## Run 16 stress 0.07580512 
## ... Procrustes: rmse 0.002025722  max resid 0.0340366 
## Run 17 stress 0.0784211 
## Run 18 stress 0.07839053 
## Run 19 stress 0.07906021 
## Run 20 stress 0.07966937 
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07631092 
## Run 1 stress 0.1023888 
## Run 2 stress 0.07644609 
## ... Procrustes: rmse 0.008136074  max resid 0.1318074 
## Run 3 stress 0.0763005 
## ... New best solution
## ... Procrustes: rmse 0.002844701  max resid 0.01619615 
## Run 4 stress 0.08251427 
## Run 5 stress 0.07682954 
## Run 6 stress 0.1085956 
## Run 7 stress 0.0930297 
## Run 8 stress 0.08324626 
## Run 9 stress 0.07769546 
## Run 10 stress 0.07682951 
## Run 11 stress 0.09494386 
## Run 12 stress 0.09558671 
## Run 13 stress 0.07633401 
## ... Procrustes: rmse 0.002432128  max resid 0.0147198 
## Run 14 stress 0.07682951 
## Run 15 stress 0.08912079 
## Run 16 stress 0.09284035 
## Run 17 stress 0.08849264 
## Run 18 stress 0.08252609 
## Run 19 stress 0.08425479 
## Run 20 stress 0.07769516 
## Run 21 stress 0.09875642 
## Run 22 stress 0.09804926 
## Run 23 stress 0.077489 
## Run 24 stress 0.07825414 
## Run 25 stress 0.08252762 
## Run 26 stress 0.07741071 
## Run 27 stress 0.08909763 
## Run 28 stress 0.09905859 
## Run 29 stress 0.09692296 
## Run 30 stress 0.08256366 
## Run 31 stress 0.07643318 
## ... Procrustes: rmse 0.008232382  max resid 0.1317669 
## Run 32 stress 0.07648734 
## ... Procrustes: rmse 0.008095048  max resid 0.1318021 
## Run 33 stress 0.07734282 
## Run 34 stress 0.07630926 
## ... Procrustes: rmse 0.001333754  max resid 0.01340305 
## Run 35 stress 0.0835959 
## Run 36 stress 0.07633806 
## ... Procrustes: rmse 0.002431014  max resid 0.01514084 
## Run 37 stress 0.0777186 
## Run 38 stress 0.08372271 
## Run 39 stress 0.08358199 
## Run 40 stress 0.09945157 
## Run 41 stress 0.1013817 
## Run 42 stress 0.0887159 
## Run 43 stress 0.08816562 
## Run 44 stress 0.09367683 
## Run 45 stress 0.08831711 
## Run 46 stress 0.08907632 
## Run 47 stress 0.08968519 
## Run 48 stress 0.1059581 
## Run 49 stress 0.07682964 
## Run 50 stress 0.07741062 
## Run 51 stress 0.07643288 
## ... Procrustes: rmse 0.008241621  max resid 0.1317565 
## Run 52 stress 0.07817518 
## Run 53 stress 0.07682964 
## Run 54 stress 0.08463743 
## Run 55 stress 0.0773429 
## Run 56 stress 0.0945852 
## Run 57 stress 0.07817481 
## Run 58 stress 0.08274334 
## Run 59 stress 0.08376885 
## Run 60 stress 0.0827057 
## Run 61 stress 0.09475385 
## Run 62 stress 0.08251025 
## Run 63 stress 0.07734274 
## Run 64 stress 0.08304635 
## Run 65 stress 0.07689318 
## Run 66 stress 0.0839375 
## Run 67 stress 0.07776996 
## Run 68 stress 0.0889937 
## Run 69 stress 0.07741296 
## Run 70 stress 0.07638979 
## ... Procrustes: rmse 0.00177121  max resid 0.01433156 
## Run 71 stress 0.0934456 
## Run 72 stress 0.07741073 
## Run 73 stress 0.08781264 
## Run 74 stress 0.100469 
## Run 75 stress 0.1015082 
## Run 76 stress 0.08321771 
## Run 77 stress 0.08917265 
## Run 78 stress 0.1029513 
## Run 79 stress 0.07827608 
## Run 80 stress 0.09328375 
## Run 81 stress 0.08372603 
## Run 82 stress 0.09383028 
## Run 83 stress 0.09812777 
## Run 84 stress 0.07632774 
## ... Procrustes: rmse 0.0008455678  max resid 0.01428012 
## Run 85 stress 0.08234673 
## Run 86 stress 0.08384494 
## Run 87 stress 0.103826 
## Run 88 stress 0.07641862 
## ... Procrustes: rmse 0.007825386  max resid 0.1318511 
## Run 89 stress 0.09491498 
## Run 90 stress 0.07734271 
## Run 91 stress 0.09126206 
## Run 92 stress 0.1016094 
## Run 93 stress 0.07817461 
## Run 94 stress 0.07682942 
## Run 95 stress 0.08358892 
## Run 96 stress 0.07631069 
## ... Procrustes: rmse 0.00280624  max resid 0.0160503 
## Run 97 stress 0.08400442 
## Run 98 stress 0.08593592 
## Run 99 stress 0.0842892 
## Run 100 stress 0.08748081 
## *** No convergence -- monoMDS stopping criteria:
##     21: no. of iterations >= maxit
##     59: stress ratio > sratmax
##     20: scale factor of the gradient < sfgrmin

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07793496 
## Run 1 stress 0.08345573 
## Run 2 stress 0.07950305 
## Run 3 stress 0.07793498 
## ... Procrustes: rmse 8.656896e-05  max resid 0.00118858 
## ... Similar to previous best
## Run 4 stress 0.08338931 
## Run 5 stress 0.08255836 
## Run 6 stress 0.08246408 
## Run 7 stress 0.08253613 
## Run 8 stress 0.07825041 
## ... Procrustes: rmse 0.007189571  max resid 0.1229027 
## Run 9 stress 0.07793476 
## ... New best solution
## ... Procrustes: rmse 7.0321e-05  max resid 0.000914181 
## ... Similar to previous best
## Run 10 stress 0.0779349 
## ... Procrustes: rmse 5.128571e-05  max resid 0.0006044514 
## ... Similar to previous best
## Run 11 stress 0.07793495 
## ... Procrustes: rmse 9.301436e-05  max resid 0.0008152133 
## ... Similar to previous best
## Run 12 stress 0.08345556 
## Run 13 stress 0.07793488 
## ... Procrustes: rmse 4.74076e-05  max resid 0.0004324236 
## ... Similar to previous best
## Run 14 stress 0.07825371 
## ... Procrustes: rmse 0.007201844  max resid 0.1230166 
## Run 15 stress 0.0794033 
## Run 16 stress 0.07825052 
## ... Procrustes: rmse 0.007188911  max resid 0.1228933 
## Run 17 stress 0.07793497 
## ... Procrustes: rmse 5.537042e-05  max resid 0.0005773593 
## ... Similar to previous best
## Run 18 stress 0.08236959 
## Run 19 stress 0.07964811 
## Run 20 stress 0.07793501 
## ... Procrustes: rmse 9.804391e-05  max resid 0.0009594424 
## ... Similar to previous best
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07410432 
## Run 1 stress 0.07711579 
## Run 2 stress 0.07711901 
## Run 3 stress 0.07361072 
## ... New best solution
## ... Procrustes: rmse 0.008233704  max resid 0.1414918 
## Run 4 stress 0.07409984 
## ... Procrustes: rmse 0.00822779  max resid 0.1414896 
## Run 5 stress 0.07719168 
## Run 6 stress 0.07410053 
## ... Procrustes: rmse 0.008225887  max resid 0.1415367 
## Run 7 stress 0.07715765 
## Run 8 stress 0.07409982 
## ... Procrustes: rmse 0.008231171  max resid 0.1415466 
## Run 9 stress 0.07360977 
## ... New best solution
## ... Procrustes: rmse 0.0002849343  max resid 0.003999329 
## ... Similar to previous best
## Run 10 stress 0.07360991 
## ... Procrustes: rmse 0.0001051132  max resid 0.001238871 
## ... Similar to previous best
## Run 11 stress 0.07410002 
## ... Procrustes: rmse 0.008227229  max resid 0.1415564 
## Run 12 stress 0.07719853 
## Run 13 stress 0.07409995 
## ... Procrustes: rmse 0.008242844  max resid 0.141839 
## Run 14 stress 0.07361075 
## ... Procrustes: rmse 0.0002670841  max resid 0.003997129 
## ... Similar to previous best
## Run 15 stress 0.07409997 
## ... Procrustes: rmse 0.008227051  max resid 0.1415342 
## Run 16 stress 0.07362478 
## ... Procrustes: rmse 0.0005759871  max resid 0.006121655 
## ... Similar to previous best
## Run 17 stress 0.07711067 
## Run 18 stress 0.07361431 
## ... Procrustes: rmse 0.0004719  max resid 0.005248818 
## ... Similar to previous best
## Run 19 stress 0.07361077 
## ... Procrustes: rmse 0.0002639393  max resid 0.003997699 
## ... Similar to previous best
## Run 20 stress 0.07711404 
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07623957 
## Run 1 stress 0.07732742 
## Run 2 stress 0.07840935 
## Run 3 stress 0.07624034 
## ... Procrustes: rmse 0.0003768013  max resid 0.004371012 
## ... Similar to previous best
## Run 4 stress 0.07780323 
## Run 5 stress 0.07647468 
## ... Procrustes: rmse 0.007730161  max resid 0.1257849 
## Run 6 stress 0.07623998 
## ... Procrustes: rmse 0.0001125424  max resid 0.001191911 
## ... Similar to previous best
## Run 7 stress 0.07625234 
## ... Procrustes: rmse 0.001765868  max resid 0.02439018 
## Run 8 stress 0.0762394 
## ... New best solution
## ... Procrustes: rmse 0.0001782712  max resid 0.0027822 
## ... Similar to previous best
## Run 9 stress 0.07835936 
## Run 10 stress 0.0779857 
## Run 11 stress 0.08035475 
## Run 12 stress 0.07835903 
## Run 13 stress 0.07780233 
## Run 14 stress 0.07907699 
## Run 15 stress 0.07625185 
## ... Procrustes: rmse 0.001880307  max resid 0.02677984 
## Run 16 stress 0.07647406 
## ... Procrustes: rmse 0.007391574  max resid 0.1255328 
## Run 17 stress 0.07736039 
## Run 18 stress 0.07647404 
## ... Procrustes: rmse 0.007386053  max resid 0.1255103 
## Run 19 stress 0.07860925 
## Run 20 stress 0.07860893 
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07486869 
## Run 1 stress 0.0840421 
## Run 2 stress 0.07509062 
## ... Procrustes: rmse 0.007326503  max resid 0.124849 
## Run 3 stress 0.07528103 
## ... Procrustes: rmse 0.008212688  max resid 0.1250903 
## Run 4 stress 0.07528102 
## ... Procrustes: rmse 0.008211582  max resid 0.1250938 
## Run 5 stress 0.07511004 
## ... Procrustes: rmse 0.02191143  max resid 0.1355159 
## Run 6 stress 0.07606211 
## Run 7 stress 0.08404329 
## Run 8 stress 0.08749215 
## Run 9 stress 0.07528218 
## ... Procrustes: rmse 0.0208214  max resid 0.1360732 
## Run 10 stress 0.07548026 
## Run 11 stress 0.07511024 
## ... Procrustes: rmse 0.02191647  max resid 0.1355547 
## Run 12 stress 0.075038 
## ... Procrustes: rmse 0.01977501  max resid 0.1359459 
## Run 13 stress 0.08001672 
## Run 14 stress 0.07509046 
## ... Procrustes: rmse 0.007330819  max resid 0.1248913 
## Run 15 stress 0.09126092 
## Run 16 stress 0.07662499 
## Run 17 stress 0.08040065 
## Run 18 stress 0.08797994 
## Run 19 stress 0.08437686 
## Run 20 stress 0.08540266 
## Run 21 stress 0.07622111 
## Run 22 stress 0.07585444 
## Run 23 stress 0.07585457 
## Run 24 stress 0.07520714 
## ... Procrustes: rmse 0.003359117  max resid 0.05437184 
## Run 25 stress 0.07512172 
## ... Procrustes: rmse 0.0178145  max resid 0.04563828 
## Run 26 stress 0.07512169 
## ... Procrustes: rmse 0.01780696  max resid 0.04562928 
## Run 27 stress 0.07990891 
## Run 28 stress 0.07733065 
## Run 29 stress 0.07994523 
## Run 30 stress 0.07511296 
## ... Procrustes: rmse 0.0205105  max resid 0.1360655 
## Run 31 stress 0.08397326 
## Run 32 stress 0.0769825 
## Run 33 stress 0.07673398 
## Run 34 stress 0.08001552 
## Run 35 stress 0.08310472 
## Run 36 stress 0.08404609 
## Run 37 stress 0.07955759 
## Run 38 stress 0.07990881 
## Run 39 stress 0.0748614 
## ... New best solution
## ... Procrustes: rmse 0.0005570582  max resid 0.00874272 
## ... Similar to previous best
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.07612839 
## Run 1 stress 0.07600935 
## ... New best solution
## ... Procrustes: rmse 0.01903209  max resid 0.06201078 
## Run 2 stress 0.07932812 
## Run 3 stress 0.07600948 
## ... Procrustes: rmse 9.157001e-05  max resid 0.001167921 
## ... Similar to previous best
## Run 4 stress 0.07601609 
## ... Procrustes: rmse 0.0005617365  max resid 0.008495579 
## ... Similar to previous best
## Run 5 stress 0.076338 
## ... Procrustes: rmse 0.02004997  max resid 0.07463981 
## Run 6 stress 0.07600931 
## ... New best solution
## ... Procrustes: rmse 5.687208e-05  max resid 0.0006066982 
## ... Similar to previous best
## Run 7 stress 0.0775551 
## Run 8 stress 0.07601587 
## ... Procrustes: rmse 0.0005422756  max resid 0.008353705 
## ... Similar to previous best
## Run 9 stress 0.07938979 
## Run 10 stress 0.07600944 
## ... Procrustes: rmse 0.0001472  max resid 0.001263037 
## ... Similar to previous best
## Run 11 stress 0.07612653 
## ... Procrustes: rmse 0.01893282  max resid 0.06235134 
## Run 12 stress 0.07612689 
## ... Procrustes: rmse 0.01889343  max resid 0.06234841 
## Run 13 stress 0.07645635 
## ... Procrustes: rmse 0.02106352  max resid 0.1375658 
## Run 14 stress 0.07642181 
## ... Procrustes: rmse 0.02049783  max resid 0.1369162 
## Run 15 stress 0.07680739 
## Run 16 stress 0.07710757 
## Run 17 stress 0.07611202 
## ... Procrustes: rmse 0.01922679  max resid 0.06808445 
## Run 18 stress 0.07645897 
## ... Procrustes: rmse 0.02106282  max resid 0.1376663 
## Run 19 stress 0.07600948 
## ... Procrustes: rmse 7.983302e-05  max resid 0.0008935215 
## ... Similar to previous best
## Run 20 stress 0.0760095 
## ... Procrustes: rmse 4.765778e-05  max resid 0.0004116567 
## ... Similar to previous best
## *** Solution reached

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.08082005 
## Run 1 stress 0.08157007 
## Run 2 stress 0.08082031 
## ... Procrustes: rmse 0.0001665894  max resid 0.001951793 
## ... Similar to previous best
## Run 3 stress 0.08295016 
## Run 4 stress 0.08242176 
## Run 5 stress 0.08242095 
## Run 6 stress 0.08082058 
## ... Procrustes: rmse 0.0001748907  max resid 0.002098015 
## ... Similar to previous best
## Run 7 stress 0.08082029 
## ... Procrustes: rmse 0.0001740461  max resid 0.002607949 
## ... Similar to previous best
## Run 8 stress 0.08082105 
## ... Procrustes: rmse 0.0002226545  max resid 0.002531585 
## ... Similar to previous best
## Run 9 stress 0.08413661 
## Run 10 stress 0.08192203 
## Run 11 stress 0.08242066 
## Run 12 stress 0.08156996 
## Run 13 stress 0.08082032 
## ... Procrustes: rmse 0.0001411719  max resid 0.001464453 
## ... Similar to previous best
## Run 14 stress 0.08192172 
## Run 15 stress 0.0829602 
## Run 16 stress 0.08190382 
## Run 17 stress 0.08295041 
## Run 18 stress 0.08242062 
## Run 19 stress 0.08082212 
## ... Procrustes: rmse 0.0002551116  max resid 0.003001193 
## ... Similar to previous best
## Run 20 stress 0.08157017 
## *** Solution reached

Stroke Data

We analyzed stroke data using the Kaggle Stroke Prediction Dataset, which contains 12 attributes, 11 being features and the final being the target column of stroke occurrence. We first began by loading and viewing the data.

stroke_data <- read_csv("healthcare-dataset-stroke-data.csv")
head(stroke_data[c(1:10)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi
9046 Male 67 0 1 Yes Private Urban 228.69 36.6
51676 Female 61 0 0 Yes Self-employed Rural 202.21 N/A
31112 Male 80 0 1 Yes Private Rural 105.92 32.5
60182 Female 49 0 0 Yes Private Urban 171.23 34.4
1665 Female 79 1 0 Yes Self-employed Rural 174.12 24
56669 Male 81 0 0 Yes Private Urban 186.21 29
head(stroke_data[c(11:12)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
smoking_status stroke
formerly smoked 1
never smoked 1
never smoked 1
smokes 1
never smoked 1
formerly smoked 1

We then cleaned the data by one-hot encoding the categorical columns and removing rows which contained NA values.

stroke_data$bmi <- as.numeric(stroke_data$bmi)
stroke_data$smoking_status <- na_if(stroke_data$smoking_status, "Unknown")
no_na_stroke_data <- drop_na(stroke_data)
no_na_stroke_data <- select(no_na_stroke_data, -id)
dummies <- dummyVars(" ~ .", data=no_na_stroke_data)
encoded <- data.frame(predict(dummies, newdata=no_na_stroke_data))

Next, we viewed the updated table.

head(encoded[c(1:8)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
genderFemale genderMale genderOther age hypertension heart_disease ever_marriedNo ever_marriedYes
0 1 0 67 0 1 0 1
0 1 0 80 0 1 0 1
1 0 0 49 0 0 0 1
1 0 0 79 1 0 0 1
0 1 0 81 0 0 0 1
0 1 0 74 1 1 0 1
head(encoded[c(9:12)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
work_typechildren work_typeGovt_job work_typeNever_worked work_typePrivate
0 0 0 1
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 1
0 0 0 1
head(encoded[c(13:18)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
work_typeSelf.employed Residence_typeRural Residence_typeUrban avg_glucose_level bmi smoking_statusformerly.smoked
0 0 1 228.69 36.6 1
0 1 0 105.92 32.5 0
0 0 1 171.23 34.4 0
1 1 0 174.12 24.0 0
0 0 1 186.21 29.0 1
0 1 0 70.09 27.4 0
head(encoded[c(19:21)]) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
smoking_statusnever.smoked smoking_statussmokes stroke
0 0 1
1 0 1
0 1 1
1 0 1
0 0 1
1 0 1

We then viewed how the number of observations associated with strokes and not associated with strokes changed after removing rows with NA values.

table(stroke_data$stroke) %>%
  kbl(col.names = c("Class", "Frequency")) %>%
  kable_styling(c("striped", "hover"))
Class Frequency
0 4861
1 249
table(no_na_stroke_data$stroke) %>%
  kbl(col.names = c("Class", "Frequency")) %>%
  kable_styling(c("striped", "hover"))
Class Frequency
0 3246
1 180

Noting that the decrease in values was not especially large and affected both classes roughly equally, we were satisfied with this method for processing our data and handling NA values.

Distribution and Correlation of Variables

summary(no_na_stroke_data)
##     gender               age         hypertension    heart_disease    
##  Length:3426        Min.   :10.00   Min.   :0.0000   Min.   :0.00000  
##  Class :character   1st Qu.:34.00   1st Qu.:0.0000   1st Qu.:0.00000  
##  Mode  :character   Median :50.00   Median :0.0000   Median :0.00000  
##                     Mean   :48.65   Mean   :0.1191   Mean   :0.06013  
##                     3rd Qu.:63.00   3rd Qu.:0.0000   3rd Qu.:0.00000  
##                     Max.   :82.00   Max.   :1.0000   Max.   :1.00000  
##  ever_married        work_type         Residence_type     avg_glucose_level
##  Length:3426        Length:3426        Length:3426        Min.   : 55.12   
##  Class :character   Class :character   Class :character   1st Qu.: 77.24   
##  Mode  :character   Mode  :character   Mode  :character   Median : 92.36   
##                                                           Mean   :108.32   
##                                                           3rd Qu.:116.21   
##                                                           Max.   :271.74   
##       bmi        smoking_status         stroke       
##  Min.   :11.50   Length:3426        Min.   :0.00000  
##  1st Qu.:25.30   Class :character   1st Qu.:0.00000  
##  Median :29.10   Mode  :character   Median :0.00000  
##  Mean   :30.29                      Mean   :0.05254  
##  3rd Qu.:34.10                      3rd Qu.:0.00000  
##  Max.   :92.00                      Max.   :1.00000
ggpairs (no_na_stroke_data,
        lower = list(
            continuous = "smooth"),
        axisLabels ="none",
        switch = 'both')

#By viewing the distribution, we can gain a better understanding of which variables may be correlated to each other.
summary(no_na_stroke_data)
##     gender               age         hypertension    heart_disease    
##  Length:3426        Min.   :10.00   Min.   :0.0000   Min.   :0.00000  
##  Class :character   1st Qu.:34.00   1st Qu.:0.0000   1st Qu.:0.00000  
##  Mode  :character   Median :50.00   Median :0.0000   Median :0.00000  
##                     Mean   :48.65   Mean   :0.1191   Mean   :0.06013  
##                     3rd Qu.:63.00   3rd Qu.:0.0000   3rd Qu.:0.00000  
##                     Max.   :82.00   Max.   :1.0000   Max.   :1.00000  
##  ever_married        work_type         Residence_type     avg_glucose_level
##  Length:3426        Length:3426        Length:3426        Min.   : 55.12   
##  Class :character   Class :character   Class :character   1st Qu.: 77.24   
##  Mode  :character   Mode  :character   Mode  :character   Median : 92.36   
##                                                           Mean   :108.32   
##                                                           3rd Qu.:116.21   
##                                                           Max.   :271.74   
##       bmi        smoking_status         stroke       
##  Min.   :11.50   Length:3426        Min.   :0.00000  
##  1st Qu.:25.30   Class :character   1st Qu.:0.00000  
##  Median :29.10   Mode  :character   Median :0.00000  
##  Mean   :30.29                      Mean   :0.05254  
##  3rd Qu.:34.10                      3rd Qu.:0.00000  
##  Max.   :92.00                      Max.   :1.00000
#The mean "average glucose" is at the prediabetes level but the median is less than the prediabetes level. BMI median and mean are at the higher end of the range denoting overweight (25-29.9). 
ggpairs (no_na_stroke_data,
        lower = list(
            continuous = "smooth"),
        axisLabels ="none",
        switch = 'both')

#Observations:
#Age is correlated to hypertension, heart disease, average glucose, and stroke.
#Heart disease is correlated with average glucose and stroke.
#Hypertension is correlated to heart disease, average glucose, BMI, and stroke.
#I considered splitting the data into age group buckets because of the distribution of the ages. However, it may be disadvantageous to separate the data further because there are few cases of strokes in the datasets. 
hist(no_na_stroke_data$stroke, xlab = "0=No Stroke, 1=Stroke", col = 'blue', main = "Stroke Distribution")

no_na_stroke_data %>% count(stroke)
## # A tibble: 2 x 2
##   stroke     n
##    <dbl> <int>
## 1      0  3246
## 2      1   180
#based on dplyr
#There are more cases of not having stroke than having stroke. Therefore, I am using logistic regression to account for this non-normal distribution. 

Histograms: Total Ages and Stroke Positive Ages

#histogram for total age data
ggplot(data=no_na_stroke_data, aes(age)) + 
  geom_histogram(aes(y =..density..), 
                 breaks=seq(0, 80, by = 1), 
                 col="black", 
                 fill="blue", 
                 alpha=0.2727) + 
  geom_density(col=1) +
  ggtitle("Total Ages")

#histogram for stroke age data
ggplot(data=no_na_stroke_data, aes(age)) + 
  geom_histogram(aes(y =..density..), 
                 breaks=seq(0, 80, by = 1), 
                 col="black", 
                 fill="green", 
                 alpha=.2) + 
  geom_density(col=1) + ylim(0,0.1) +
  ggtitle("Ages of Stroke Positive Individuals")

It appears that a broad range of ages were included in this study. As seen in the second histogram, the data is significantly skewed left, meaning that individuals positive for stroke were on average of higher age (around 40 to 80 years old). These findings are consistent with knowledge already known about aging and stroke risk, that being that increased age comes with a significantly elevated risk for stroke development.

Histograms: Average Total and Stroke Positive Glucose Levels

#histogram for total average glucose level
ggplot(data=no_na_stroke_data, aes(avg_glucose_level)) + 
  geom_histogram(aes(y =..density..), 
                 breaks=seq(55, 280, by = 3), 
                 col="black", 
                 fill="pink", 
                 alpha=0.2727) + 
  geom_density(col=1) +
  ggtitle("Total Average Glucose Levels")

#histogram for stroke average glucose level
ggplot(data=no_na_stroke_data, aes(avg_glucose_level)) + 
  geom_histogram(aes(y =..density..), 
                 breaks=seq(55, 280, by = 3), 
                 col="black", 
                 fill="red", 
                 alpha=.2) + 
  geom_density(col=1) +
  ggtitle("Average Glucose Levels of Stroke Positive Individuals")

From this graph, we can see how although many positive stroke cases presented with normal blood glucose levels, there were also several positive stroke cases that exhibited elevated glucose levels. These elevated levels could be correlated to presence of diabetes, which is a disease known to directly increase stroke risk. However, this variable was not included in the dataset, so this correlation cannot be investigated using this data.

Histogram: Hypertension and Stroke Positive Individuals

  ggplot(data = no_na_stroke_data, aes(x = hypertension)) + 
  geom_bar(fill="lightblue") +
  ggtitle("Hypertension in Stroke Positive Patients")

This figure shows that more stroke positive individuals do not have hypertension, although hypertension is known to increase stroke risk. This could be due to the fact that there is a very small sample size of stroke positive patients. A larger sample size may provide more accurate results.

Histogram: Heart Disease and Stroke Positive Individuals

ggplot(data = no_na_stroke_data, aes(x = heart_disease)) + 
  geom_bar(fill="orange") +
  ggtitle("Heart Disease in Stroke Positive Patients")

The same is found that more stroke positive patients did not have heart disease, even though this variable is correlated to increased stroke risk. Increased sample size may provide different results.

Neural Network and K-Nearest Neighbor

We first split the data into train and test sets, sampling separately between the subset of rows in the dataset which were associated with strokes and those which were not associated with strokes in order to maintain the same proportion of examples in the training and test sets. We used 30% of the full dataset for the test set and the remaining 70% for the training set.

encoded_nostroke <- subset(encoded, stroke == 0)
encoded_stroke <- subset(encoded, stroke == 1)
nostroke_n <- nrow(encoded_nostroke)
stroke_n <- nrow(encoded_stroke)
set.seed(100)
valTestIndex_stroke <- sample(1:stroke_n, size = round(0.3*stroke_n), replace = FALSE)
valTestIndex_nostroke <- sample(1:nostroke_n, size = round(0.3*nostroke_n), replace = FALSE)
train_stroke <- encoded_stroke[-valTestIndex_stroke ,]
train_nostroke <- encoded_nostroke[-valTestIndex_nostroke ,]
test_stroke <- encoded_stroke[valTestIndex_stroke ,]
test_nostroke <- encoded_nostroke[-valTestIndex_nostroke ,]
train <- rbind(train_stroke, train_nostroke)
train_X <- select(train, -stroke)
train_y <- select(train, stroke)
test <- rbind(test_stroke, test_nostroke)
test_X <- select(test, -stroke)
test_y <- select(test, stroke)

We then pre-processed the data by subtracting the mean of each column from each value and dividing by the standard deviation of each column. We fit this transformation on the train set and applied it to both the train and test sets to avoid information leakage between the sets.

preprocessor <- preProcess(train[, c(4, 16, 17)], method=c("center", "scale"))
normalized_train <- predict(preprocessor, train[, c(4, 16, 17)])
normalized_test <- predict(preprocessor, test_X[, c(4, 16, 17)])
train[, c(4, 16, 17)] <- normalized_train
test_X[, c(4, 16, 17)] <- normalized_test

We then trained a simple two-layer neural network to learn to classify inputted vectors of the health data as either being associated with a person who had a stroke or with a person who did not have a stroke. Essentially, a neural network learns values for weights and biases to multiply and add to inputted health information by in order to minimize its loss, or error in making a prediction. We used the cross-entropy loss function for the training process and set a threshold of 0.4 to allow for convergence. The activation function used between each layer was the logistic function, which allowed the model to learn nonlinear relationships between layers. We attempted to use the popular TensorFlow/Keras neural network framework for creating more complex neural network structures, but found it to be too difficult to run in the RStudio Cloud environment given the python backend TensorFlow/Keras uses. We instead used the more simplistic “neuralnet” package in R. Given that the neural network training process is non-deterministic, we set a seed before training the model to ensure reproducibility.

set.seed(123)
model <- neuralnet(stroke~., train, c(5, 5), err.fct = "ce", act.fct = "logistic", linear.output = FALSE, threshold = 0.4)

We can then plot the neural network to better understand the flow of information through the model and to visualize the learned weights and biases following the training process.

plot(model, rep = "best")

We then used the trained model to predict on the test set.

predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
predictions <- as.numeric(predictions)
preds_and_results <- data.frame("True" = test_y, "Predicted" = predictions)

Next, we evaluated the performance of the model on the test set using an array of metrics, however our primary metric used is the balanced accuracy score. This value is the average of the model’s accuracy in correctly classifying inputted vectors with not being associated with strokes as well as in correctly classifying vectors as being associated with strokes. This metric then gives equal weight to correctly classifying the positive and negative class, which is useful in evaluating performance on datasets with significant imbalances between the negative and positive classes.

eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
Balanced Accuracy Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC
0.536817 0.9780739 0.1355932 0.0740741 0.9995599 0.8 0.9784576 0.536817

We find reasonable accuracy of the model, yet large discrepancies between the positive predictive value and the negative predictive value. We later attempt to oversample the data using the Synthetic Minority-Oversampling Technique (SMOTE) to address this problem.

We then plotted a confusion matrix to better assess the predictions of the model.

confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)

We find, as suspected, that the model has a bias in classifying inputted health information as not being associated with a stroke given the large imbalance in the dataset, resulting in a significant amount of false negatives in relation to false positives. We address this in the next section

Oversampling With SMOTE

We used the SMOTE algorithm to generate synthetic yet realistic examples in the training set for the minority class in order to help equalize the class imbalance. This algorithm works by drawing lines between nearby values of the minority class in the vector space and by populating those lines with data given the minority class label, thereby generating new, realistic data values.

We begin by selecting the features and labels independently from the training data.

train_X <- select(train, -stroke)
train_y <- select(train, stroke)

We verify the imbalanced class distribution.

class_distribution <- table(train_y)
class_distribution %>%
  kbl(col.names = c("Class", "Frequency")) %>%
  kable_styling(c("striped", "hover"))
Class Frequency
0 2272
1 126

We then set a seed for reproducibility and use SMOTE to oversample such that the number of examples for the majority and minority classes are roughly equal.

set.seed(100)
oversampled_train <- SMOTE(train_X, train_y, dup_size = class_distribution[1]/class_distribution[2])

We then see that the algorithm succeeds in equalizing the class distributions.

oversampled_train <- oversampled_train$data
oversampled_train$class <- as.numeric(oversampled_train$class)
table(oversampled_train$class) %>%
  kbl(col.names = c("Class", "Frequency")) %>%
  kable_styling(c("striped", "hover"))
Class Frequency
0 2272
1 2394

We now train a neural network with the same architecture, loss function, and activation function on the oversampled training data.

remove(model)
set.seed(123)
model <- neuralnet(class~., oversampled_train, c(5, 5), err.fct = "ce", act.fct = "logistic", linear.output = FALSE, threshold = 0.8)

We again the plot the model to visualize the structure as well as the weights and biases.

plot(model, rep = "best")

Again we predict on the test set to gain an objective measure of the model’s accuracy.

predictions <- neuralnet::compute(model, test_X)
predictions <- ifelse(predictions$net.result > 0.5, 1, 0)
predictions <- as.numeric(predictions)
preds_and_results <- data.frame("True" = test_y, "Predicted" = predictions)

We again evaluate the performance of the neural network with the same metrics.

eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
Balanced Accuracy Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC
0.6284559 0.8215821 0.0997831 0.4259259 0.8309859 0.0565111 0.9838458 0.6284559

We see that the balanced accuracy score has somewhat increased while overall accuracy decreased, suggesting an increase in the accuracy of classifying true positives and a decrease in the accuracy of classifying true negatives.

We again plot a confusion matrix to evaluate the distribution of the model’s classifications in relation to their true values.

confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)

We note, as expected, an uptick in the accuracy of classifying true positives and a drop in the accuracy of classifying true negatives.

Given more time, we may have experimented with other techniques for accounting for class imbalance in our dataset, such as by adjusting our threshold, undersampling, or weighting minority examples more heavily in the training process.

K-Nearest Neighbor

We then used a K-nearest neighbor algorithm on the oversampled dataset to assess accuracy, first selecting the feature and target groups.

over_train_X <- select(oversampled_train, -class)
over_train_y <- select(oversampled_train, class)

We then ran the algorithm with a K value equal to 1. Given more time we would have split our dataset into three sets and performed hyperparameter tuning to determine the optimal value of K for our problem.

predictions <- knn(train = over_train_X, k = 1, test = test_X, cl = over_train_y[, 1])

We then processed the predictions of the knn algorithm.

remove(preds_and_results)
preds_and_results <- tibble(test_y, "Predicted" = predictions)
preds_and_results$Predicted <- as.numeric(as.character(preds_and_results$Predicted))

Finally, we evaluated the performance of the algorithm on our dataset.

eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
Balanced Accuracy Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC
0.5648148 0.9797936 0.2295082 0.1296296 1 1 0.9797326 0.5648148

We see that the balanced accuracy score is lower than that of the neural network trained on oversampled data.

We again better evaluate classification accuracy with a confusion matrix.

confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)

We see again our original issue, with a high percentage of true negatives, yet a very low percentage of true positives. Overall, we determine a neural network trained on oversampled data, given the complexity and flexibility of the algorithm, to be the most effective model for correctly classifying people as either having had or not having had strokes of those models yet evaluated.

Binary Logistic Regression

#The binary logistic regression model is used because the stroke data only has values of 0 and 1 and the distribution is non-normal. This model is simpler and may run faster than the k-nearest neighbor. However, the disadvantage is that the model does not have the neural network’s capability of complex representation learning.
newmodel <- glm(class~., family = binomial, data=oversampled_train)
#The oversampled train data allows the "stroke = 1" observations to be equal to the "stroke = 0" which may improve the accuracy of the model. Class signifies stroke.
summary(newmodel)
## 
## Call:
## glm(formula = class ~ ., family = binomial, data = oversampled_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6962  -0.6986   0.3578   0.7448   2.1749  
## 
## Coefficients: (4 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.335e+13  3.127e+13  -0.427 0.669512    
## genderFemale                   1.335e+13  3.127e+13   0.427 0.669512    
## genderMale                     1.335e+13  3.127e+13   0.427 0.669512    
## genderOther                    1.335e+13  3.127e+13   0.427 0.669512    
## age                            1.761e+00  6.463e-02  27.244  < 2e-16 ***
## hypertension                   8.609e-01  1.027e-01   8.380  < 2e-16 ***
## heart_disease                  6.654e-01  1.371e-01   4.855 1.20e-06 ***
## ever_marriedNo                -6.045e-01  1.466e-01  -4.122 3.75e-05 ***
## ever_marriedYes                       NA         NA      NA       NA    
## work_typechildren             -2.060e+01  5.329e+04   0.000 0.999692    
## work_typeGovt_job              5.046e-01  1.356e-01   3.721 0.000199 ***
## work_typeNever_worked         -2.113e+01  1.130e+05   0.000 0.999851    
## work_typePrivate               8.901e-01  1.018e-01   8.740  < 2e-16 ***
## work_typeSelf.employed                NA         NA      NA       NA    
## Residence_typeRural           -1.165e-01  7.781e-02  -1.497 0.134327    
## Residence_typeUrban                   NA         NA      NA       NA    
## avg_glucose_level              4.178e-02  3.603e-02   1.160 0.246241    
## bmi                            1.007e-01  4.606e-02   2.187 0.028751 *  
## smoking_statusformerly.smoked -4.390e-02  1.167e-01  -0.376 0.706830    
## smoking_statusnever.smoked    -4.204e-02  1.101e-01  -0.382 0.702549    
## smoking_statussmokes                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6465.3  on 4665  degrees of freedom
## Residual deviance: 4339.1  on 4649  degrees of freedom
## AIC: 4373.1
## 
## Number of Fisher Scoring iterations: 25
#Observations (+) is positive correlation with stroke and (-) is negative correlation with stroke.
#Independent Variables with p<0.01: female (+), age (+), hypertension (+), not married (+), average glucose (+), bmi (+), formerly smoked (-), never smoked (-)
#Independent Variables with 0.01<p<0.05: #government job (-), rural residence (-)

Prediction Test

#test_X contains the independent variables and test_Y contains the dependent variable, stroke, of the validation sets.
predictTest <- predict(newmodel, test_X, type = "response")
classification <- ifelse(predictTest > 0.5, 1,0)
#In this case, 0.5 is our threshold. 
classification <- as.numeric(classification)
#Prediction Results and Confusion Matrix
preds_and_results <- data.frame("True" = test_y, "Predicted" = classification)
eval <- evaluate(preds_and_results, target_col = "stroke", prediction_cols = "Predicted", type = "binomial")
numeric_performance <- eval[, c(1:8)]
head(numeric_performance) %>%
  kbl() %>%
  kable_styling(c("striped", "hover"))
Balanced Accuracy Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC
0.703288 0.7205503 0.1022099 0.6851852 0.7213908 0.0552239 0.9897343 0.703288
confusion_matrix <- eval$"Confusion Matrix"[[1]]
plot_confusion_matrix(confusion_matrix)

#1.7% of stroke patients were predicted and 71.2% of non-stroke patients were predicted. The worst inaccuracy would be if a person is predicted to not have a stroke but actually will have a stroke. In this case, 0.6% were incorrectly predicted. The inaccuracy was greater when the model predicted stroke but the person did not have a stroke (26.4%).

Conclusion

Overall, we observed that variables such as gender, age, hypertension, average glucose, BMI, not smoking, and not being married all greatly influenced our model (p<0.01). This indicates that predicting stroke is difficult because of the large number of variables in a person’s life. Some barriers that we encountered was that lack of data of patients with strokes which influenced the accuracy of the models. We overcame this issue by oversampling the stroke data and then separating it into train and test sets. Similarly, there are many risk factors contributing to the development of heart disease, which resulted in weak predictions with simpler models employed here and call for more complex analyses.