eval lab

heart attack analysis

step 1: data processing

Preprocessing the dataset requires:

reading the data into the environment
verify that the data is clean (contains no missing variables or duplicates)
scaling the numerical data (to ensure no attribute is weighted unfairly)

heart_df<-read_csv("heart.csv")

na_tf<-sum(is.na(heart_df)) # verify the data is clean
dup<-table(duplicated(heart_df)) # check for duplicates
heart_df<-heart_df%>%distinct() # since there was 1 duplicate, we create a new dataset from the unique rows 

heart_df[, 1:(length(colnames(heart_df))-1)] <- lapply(heart_df[, 1:(length(colnames(heart_df))-1)],function(x) scale(x))

step 2: base rate calculation

The base rate returns the probability of being able to correctly identify an individual having a greater risk of heart failure with no observational information. Calculated using the split between 1 (greater risk of heart attack, 54.3%) and 0 (less risk of heart attack, 45.69%), this value indicates we have a 54.3% chance of correctly identifying if the individual has a greater risk of heart attack.

table(heart_df$target)[2]/sum(table(heart_df$target))

##         1 
## 0.5430464

# the base rate is roughly 164/138, indicating that we have a 54.3% chance of correctly identifying an individual with a greater risk of heart failure when guessing with no information about the observation

step 3: column paring

Since there are categorical attributes with no intrinsic order (cp, rest_ecg, thal), we can pare down the dataset and drop the columns.

heart_simplified<-heart_df%>%select(-c("cp","restecg", "thal"))

step 4: variable correlation verification

Highly correlated variables signifies that there is data redundancy, a principle danger of which is artificial weight inflation and overfitting. The dimensionality of our data can be observed with the correlation map:

cor_matrix<-cor(heart_simplified%>%select(-("target")))
ggcorrplot(cor_matrix)

step 5: attribute removal

Highly-correlating features can be defined as returning a correlation of above 0.7 and below -0.7 or correlating significantly with multiple variables; exploratory analysis of the correlation matrix shows no variable to be dropped. Reduced dimensionality allows for the encoding of information without data redundancy; our dataset, heart_simplified, can now be run through an initial kNN model.

step 6: train and test set generation

Splitting our data into train and test sets allows us to validate the performance of the trained model. To verify the 80-20 data split, we divide the number of observations in the training set by the summed number of observations in the original set.

#6. Use the index to generate a train and test sets, then check the row counts to be safe.
set.seed(1999)
heart_train_rows = sample(1:nrow(heart_simplified),#<- from 1 to the number of #rows in the data set
                 round(0.8 * nrow(heart_simplified), 0),  #<- multiply the number of rows by 0.8 and round 
                 replace = FALSE)#<- don't replace the numbers

length(heart_train_rows) / nrow(heart_simplified)

## [1] 0.8013245

heart_train = heart_simplified[heart_train_rows, ]
heart_test = heart_simplified[-heart_train_rows, ]

step 7: knn baseline model training

We can run an initial model analysis using k=3; from this, we can begin to analyze if the model is better than our base-rate and how we can tune performance.

#7 Train the classifier using k = 3, remember to set.seed so you can repeat the output and to use the labels as a vector for the class (not a index of the dataframe)
col_n<-colnames(heart_simplified)

set.seed(1999)
heart_3NN <-  knn(train = heart_train[, (col_n[1:(length(col_n)-1)])],#<- training set cases
               test = heart_test[, (col_n[1:(length(col_n)-1)])],    #<- test set cases
               cl = heart_train$target,#<- category for true classification
               k = 3,#<- number of neighbors considered
               use.all = TRUE,
               prob = TRUE)

step 8: model output evaluation

#9  Run the confusion matrix function and comment on the model output
confusionMatrix(as.factor(heart_3NN), as.factor(heart_test$target), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 20  5
##          1  8 27
##                                          
##                Accuracy : 0.7833         
##                  95% CI : (0.658, 0.8793)
##     No Information Rate : 0.5333         
##     P-Value [Acc > NIR] : 5.405e-05      
##                                          
##                   Kappa : 0.5618         
##                                          
##  Mcnemar's Test P-Value : 0.5791         
##                                          
##             Sensitivity : 0.8438         
##             Specificity : 0.7143         
##          Pos Pred Value : 0.7714         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.5333         
##          Detection Rate : 0.4500         
##    Detection Prevalence : 0.5833         
##       Balanced Accuracy : 0.7790         
##                                          
##        'Positive' Class : 1              
##

The confusionMatrix function allows us to analyze success metrics like sensitivity and specificity; our initial kNN model where k=3 has an accuracy of 78.69%, a sensitivity of 82.76%, and a specificity of 75%.

step 9: hyperparameter k tuning

To ameliorate sensitivity and accuracy rates, we must tune our hyperparameter k and isolate the value that will return the best model performance without overfitting. To realize this, we can build kNN models that will test values of k from 1 to 21.

#10 Run the "chooseK" function to find the perfect K, while using sapply() function on chooseK() to test k from 1 to 21 (only selecting the odd numbers), and set the train_set argument to 'commercial_train', val_set to 'commercial_test', train_class to the "label"   column of 'commercial_train', and val_class to the "label" column of 'commercial_test'. Label this  "knn_diff_k_com"

chooseK<-function(k, train_set, val_set, train_class, val_class){
  
# Build knn with k neighbors considered.
  set.seed(1999)
  class_knn<-knn(train = train_set,    #<- training set cases
                  test = val_set,       #<- test set cases
                  cl = train_class,     #<- category for classification
                  k = k,                #<- number of neighbors considered
                  use.all = TRUE)       #<- control ties between class assignments
                  conf_mat = table(class_knn, val_class)
  
  #accu = sum(conf_mat[row(conf_mat) == col(conf_mat)]) / sum(conf_mat)  
  sen=conf_mat[2,2]/sum(conf_mat[,2])
  cbind(k = k, sensitivity = sen)
}

knn_different_k<-sapply(seq(1, 21, by = 2),  #<- set k to be odd number from 1 to 21
                         function(x) chooseK(x, 
                                             train_set = heart_train[, (col_n[1:(length(col_n)-1)])],
                                             val_set = heart_test[, (col_n[1:(length(col_n)-1)])],
                                             train_class = heart_train$target,
                                             val_class = heart_test$target))

step 10: compatibility conversion

To visualize the kNN models against their sensitivity, we convert the matrix to a dataframe.

#11 Create a dataframe so we can visualize the difference in accuracy based on K, convert the matrix to a dataframe

knn_different_k<-tibble(k = knn_different_k[1,],
                             sensitivity = knn_different_k[2,])

step 11: k visualization

To isolate the ideal value of k, we can graph k against model sensivity:

#12 Use ggplot to show the output and comment on the k to select.

ggplot(knn_different_k,
       aes(x = k, y = sensitivity)) +
  geom_line(color = "pink", size = 1.5) +
  geom_point(size = 3)

At k=7, the model shows diminishing returns, suggesting that our ideal value of k to return a high sensitivity and low overfit risk is 7.

step 12: tuned model rerun

After isolating the ideal number of neighbors to be considered, we can run our model with its tuned parameters.

#13 Rerun the model  with the k you selected, assuming it's different. 
set.seed(1999)
heart_7NN<-knn(train = heart_train[, (col_n[1:(length(col_n)-1)])],#<- training set cases
               test = heart_test[, (col_n[1:(length(col_n)-1)])],    #<- test set cases
               cl = heart_train$target,#<- category for true classification
               k = 7,#<- number of neighbors considered
               use.all = TRUE,
               prob = TRUE)

step 13: tuned model study

#14 Use the confusion matrix function to measure the quality of the new model.
confusionMatrix(as.factor(heart_7NN), as.factor(heart_test$target), positive = "1", dnn=c("Prediction", "Actual"), mode = "sens_spec")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 20  6
##          1  8 26
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 0.0001655       
##                                           
##                   Kappa : 0.5291          
##                                           
##  Mcnemar's Test P-Value : 0.7892680       
##                                           
##             Sensitivity : 0.8125          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7647          
##          Neg Pred Value : 0.7692          
##              Prevalence : 0.5333          
##          Detection Rate : 0.4333          
##    Detection Prevalence : 0.5667          
##       Balanced Accuracy : 0.7634          
##                                           
##        'Positive' Class : 1               
##

Our initial kNN model where k=3 has an accuracy of 78.69%, a sensitivity of 82.76%, and a specificity of 75%; in comparison, our tuned model of k=7 has an accuracy of 80.33% (+1.31%), a sensitivity of 89.66% (+6.90%), and a specificity of 71.88% (-3.12%). A kappa of .60 suggests our model is preforming moderately well.

The cost of mis-classification is higher for a type 2 error (false negative) than for a type 1 error (false positive): we’d rather over-identify individuals at greater risk of heart attack than under-identify individuals at lower risk of heart attack. As such, the rate of false negatives (classifying an individual at high risk of heart failure as one with low risk of heart failure) is critical to the success of the model.

The rate of false negatives is the complement of our sensitivity (true positive rate):

fnr<-1-0.8966
fnr

## [1] 0.1034

The rate of false positives is the complement of our specificity (true negative rate):

fpr<-1-0.7188
fpr

## [1] 0.2812

Since our model was tuned from a type 2 error lens, it realizes a much lower rate of false negatives than false positives. The context of the dataset allows us to prioritize sensitivity as an evaluation metric.

step 14: more metric analysis

To maximize our sensitivity, we can analyze the decision threshold of our model as a dynamic parameter. First, we can provide some base measures to evaluate the balance of our model (log loss and F1).

Log-loss is a measure of how far the prediction probability is from the actual value; the higher the log-loss value from 0, the more disparate the prediction probability.

prob_knn <- tibble(heart_7NN, attributes(heart_7NN)$prob)
prob_knn$prob <- if_else(prob_knn$heart_7NN == 0,
                         1-prob_knn$`attributes(heart_7NN)$prob`, prob_knn$`attributes(heart_7NN)$prob`)

heart_eval<-data.frame(pred_class = heart_7NN, 
                        pred_prob = prob_knn$prob, 
                        target = as.numeric(heart_test$target))

paste("log-loss=", LogLoss(as.numeric(heart_eval$pred_prob), as.numeric(heart_test$target)))

## [1] "log-loss= 1.54526715953857"

Our relatively high log-loss value suggests that the model’s prediction probabilities are relatively far from the classifier target values (either through mis-classification or low model confidence).

F1 is a measure of the harmonic balance between precision and recall; an F1 score of 1 is considered perfect.

paste("f1=", F1_Score(as.numeric(heart_eval$pred_prob), as.numeric(heart_test$target)))

## [1] "f1= 0.303030303030303"

While our F1 score is relatively low compared to the ideal value, F1 works best for uneven class distributions and balanced precision-recall. Since we are prioritizing sensitivity to minimize the false negative rate (and our target distribution is fairly even), we are less concerned with the overall balance of metrics.

To adjust the threshold value of our model, we first plot a ROC curve (the performance of a binary classifier against different thresholds):

heart_pred <- prediction(heart_eval$pred_prob, heart_eval$target)
heart_roc <- performance(heart_pred, "tpr", "fpr")
plot(heart_roc, colorize = TRUE)

The shape of the ROC curve is exponential and passes through the top left corner of the graph, indicating that the model has significant predictave power. This hypothesis is verified with the AUC score (the probability that randomly-selected samples are correctly ordered by prediction probability):

auc_value <- performance(heart_pred, 'auc')
paste("auc=", auc_value@y.values)

## [1] "auc= 0.833705357142857"

The higher the AUC score (0-1), the better the classifier; our relatively strong AUC score indicates our model is fairly high-preforming.

From ROC curve we can identify a threshold that maximizes the true positive rate while holding the false positive rate within reasonable constraints; at a threshold value of .45, we see a high true positive rate while the false positive rate is fairly controlled.

adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=test_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}


adjust_thres(heart_eval$pred_prob, .45, as.factor(heart_eval$target))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 20  6
##          1  8 26
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.6396, 0.8662)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 0.0001655       
##                                           
##                   Kappa : 0.5291          
##                                           
##  Mcnemar's Test P-Value : 0.7892680       
##                                           
##             Sensitivity : 0.8125          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7647          
##          Neg Pred Value : 0.7692          
##               Precision : 0.7647          
##                  Recall : 0.8125          
##                      F1 : 0.7879          
##              Prevalence : 0.5333          
##          Detection Rate : 0.4333          
##    Detection Prevalence : 0.5667          
##       Balanced Accuracy : 0.7634          
##                                           
##        'Positive' Class : 1               
##

Running our model on our modified threshold of y=0.45 shows identical metrics to the previous model; this is understandable, as the ideal threshold is not far from the default of 0.5. To show that the prediction probability split is not high, we can adjust the threshold to one that specifically maximizes the sensitivity with no regard to specificity (0.3):

adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=test_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}


adjust_thres(heart_eval$pred_prob, .3, as.factor(heart_eval$target))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 12  1
##          1 16 31
##                                           
##                Accuracy : 0.7167          
##                  95% CI : (0.5856, 0.8255)
##     No Information Rate : 0.5333          
##     P-Value [Acc > NIR] : 0.002868        
##                                           
##                   Kappa : 0.4111          
##                                           
##  Mcnemar's Test P-Value : 0.000685        
##                                           
##             Sensitivity : 0.9688          
##             Specificity : 0.4286          
##          Pos Pred Value : 0.6596          
##          Neg Pred Value : 0.9231          
##               Precision : 0.6596          
##                  Recall : 0.9688          
##                      F1 : 0.7848          
##              Prevalence : 0.5333          
##          Detection Rate : 0.5167          
##    Detection Prevalence : 0.7833          
##       Balanced Accuracy : 0.6987          
##                                           
##        'Positive' Class : 1               
##

Conclusively, this threshold increases the specificity while significantly impacting accuracy, kappa, and specificity. We suggest that the default threshold is ideal for maintaining some balance between true positive and true negative classification rates.

step 15: conclusion

From our mis-classification analysis, we see that our model is fairly volatile across all its metrics. While high-preforming for sensitivity, it sacrifices balance to counteract its high cost of type 2 errors. This can be partially attributed to the size of the dataset: while evenly distributed between the binary target, there are only 303 observations for the model to cluster. To improve the model, we suggest aggregating more data to realize closer predictive probabilities and to allow more room for researchers to analyze and prioritize metrics other than specificity. Furthermore, multi-factor categorical variables can be encoded to provide the model with more data information.

australian rain analysis

data cleaning

We started by cleaning the data. The general process was to delete redundant categorical variables of direction, and to delete columns that had more than 30% of their data as NA’s. Afterwards, the data was scaled and split into training and testing (80-20 ratio).

set.seed(2001)

weatherAUS_data <- read.csv("weatherAUS.csv", #<- name of the data set.
                     check.names = FALSE, #<- don't change column names.
                     stringsAsFactors = FALSE)#<- don't convert the numbers and characters to factors

weatherAUS_data.cleaned <- weatherAUS_data
weatherAUS_data.cleaned <- weatherAUS_data.cleaned[,c(-1,-2)] # Get rid of Date and Location 
weatherAUS_data.cleaned <- weatherAUS_data.cleaned[, which(colMeans(is.na(weatherAUS_data.cleaned)) < 0.3)] # Delete columns with more than 30% na's
weatherAUS_data.cleaned <- weatherAUS_data.cleaned[complete.cases(weatherAUS_data.cleaned), ]
weatherAUS_data.cleaned <- weatherAUS_data.cleaned %>% select(-contains("Dir")) # Get rid of Directional data

weatherAUS_data.cleaned$RainToday <- recode(weatherAUS_data.cleaned$RainToday, 'No' = 0, 'Yes' = 1)
weatherAUS_data.cleaned$RainTomorrow <- recode(weatherAUS_data.cleaned$RainTomorrow, 'No' = 0, 'Yes' = 1)

correlations <- cor(weatherAUS_data.cleaned) # View correlations in case

weatherAUS_data.cleaned[, -ncol(weatherAUS_data.cleaned)] <- lapply(weatherAUS_data.cleaned[, -ncol(weatherAUS_data.cleaned)],function(x) scale(x)) # scale the data for better analysis

# Create a shortened dataframe for caret analysis, since original dataframe is computationally expensive
weatherAUS_data.shortened <- weatherAUS_data.cleaned[1:10000,]


#caret function the will allow us to divide the data into test and train, it will randomly assign rows into each category while maintaining the relative balance (0 and 1s) of the target variable. 
split_index <- createDataPartition(weatherAUS_data.cleaned$RainTomorrow, p = .8, # split 80% - 20% 
                                  list = FALSE,#output of the data, we don't want a list
                                  times = 1)#the number of partitions to create we just want one

# Create a split for a shortened training and testing split for computation
split_index_shortened <- createDataPartition(weatherAUS_data.shortened$RainTomorrow, p = .8, # split 80% - 20% 
                                  list = FALSE,#output of the data, we don't want a list
                                  times = 1)#the number of partitions to create we just want one


#then we just pass the index to our dataset

weatherAUS_data_train <- weatherAUS_data.cleaned[split_index,]
dim(weatherAUS_data_train) # Confirm dimensions

## [1] 90137    14

weatherAUS_data_test <- weatherAUS_data.cleaned[-split_index,]
dim(weatherAUS_data_test) # Confirm dimensions

## [1] 22534    14

# Shortened datasets
weatherAUS_data_train.shortened <- weatherAUS_data.shortened[split_index_shortened,]
weatherAUS_data_test.shortened <- weatherAUS_data.shortened[-split_index_shortened,]

One point to note is that the dataset consists of over 100,000 points, making it computationally expensive to analyze as a whole. As a solution, a secondary shortened dataset was created to quickly find the optimal number of neighbors. Afterwards- with the chosen optimal K value, the original, 100,000+ point dataset, was run- saving excessive computational time.

caret analysis to quickly find optimal number of neighbors.

Using the R caret package, the optimal number of neighbors was found.

# Run a quick 
trctrl <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 3) # generic control to pass back into the knn mode using the cross validation method. 
weatherAUS_data_train.shortened$RainTomorrow <- as.factor(weatherAUS_data_train.shortened$RainTomorrow)

weatherAUS_knn_caret <- train(RainTomorrow~.,
                  data = weatherAUS_data_train.shortened,
                  method="knn",
                  tuneLength=5,
                  trControl= trctrl,#cv method above, will select the optimal K
                  preProcess="scale") # already did this but helpful reference
weatherAUS_knn_caret # take a look

## k-Nearest Neighbors 
## 
## 8000 samples
##   13 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 7200, 7200, 7200, 7200, 7200, 7200, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8505000  0.4765761
##    7  0.8550000  0.4851985
##    9  0.8578333  0.4903486
##   11  0.8607083  0.4991793
##   13  0.8610833  0.4967816
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.

The optimal value found by the algorithm was k=13. Since the accuracy and kappa values are better relative to the other k’s, this k value was then used to create the model and generate a confusion matrix.

kNN confusion matrix generation

Using the optimal k value, a confusion matrix was generated.

# KNN Work

weatherAUS_13NN <-  knn(train = weatherAUS_data_train[,-ncol(weatherAUS_data_train)],#<- training set cases
               test = weatherAUS_data_test[,-ncol(weatherAUS_data_test)],    #<- test set cases
               cl = weatherAUS_data_train[,"RainTomorrow"],#<- category for true classification
               k = 13,#<- number of neighbors considered
               use.all = TRUE,
               prob = TRUE) #<- control ties between class assignments If true, all distances equal to the kth largest are included
length(weatherAUS_13NN)

## [1] 22534

confusionMatrix(as.factor(weatherAUS_13NN), as.factor(weatherAUS_data_test$RainTomorrow), positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction     0     1
##          0 16689  2616
##          1   874  2355
##                                           
##                Accuracy : 0.8451          
##                  95% CI : (0.8403, 0.8498)
##     No Information Rate : 0.7794          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4849          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4737          
##             Specificity : 0.9502          
##          Pos Pred Value : 0.7293          
##          Neg Pred Value : 0.8645          
##               Precision : 0.7293          
##                  Recall : 0.4737          
##                      F1 : 0.5744          
##              Prevalence : 0.2206          
##          Detection Rate : 0.1045          
##    Detection Prevalence : 0.1433          
##       Balanced Accuracy : 0.7120          
##                                           
##        'Positive' Class : 1               
##

The confusion matrix resulted in an accuracy of 84.51% and a Kappa of 0.4849, which was expected from the caret analysis. However, upon closer examination the sensitivity value is 47.37% compared to the specificity value of 95.02%- which means that the true positive rate is about as good as a random classifier. On the other hand the specificity value is high, which means that the model can predict when it doesn’t rain better than when it does. This can definitely be improved on.

confirmation of confusion matrix

As mentioned before we see that the Accuracy is 84.52% and that the Kappa value is 0.4849.

To get the TPR and FPR, we used the following formulas.

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

tpr <- 2356/(2356+2615)
sprintf("TPR Calculated: %s", tpr)

## [1] "TPR Calculated: 0.473948903641118"

fpr <- 873/(873 + 16690)
sprintf("FPR Calculated: %s", fpr)

## [1] "FPR Calculated: 0.0497067699140238"

Next, we calculated the F1 Score

F1 <- F1_Score(y_pred = weatherAUS_13NN, y_true = weatherAUS_data_test$RainTomorrow, positive = "1")
sprintf("F1 of Model: %s", F1)

## [1] "F1 of Model: 0.574390243902439"

The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. In our model’s case we performed better than average, though this can definitely be improved on.

log-loss

Then the log loss value was then found to evaluate the performance of the KNN model.

#In order to use most evaluation packages it's just easier to have are predictions and targets in one place. 

prob_knn <- tibble(weatherAUS_13NN, attributes(weatherAUS_13NN)$prob)
prob_knn$prob <- if_else(prob_knn$weatherAUS_13NN == 0,
                         1-prob_knn$`attributes(weatherAUS_13NN)$prob`, prob_knn$`attributes(weatherAUS_13NN)$prob`) #### this is a example of converting the probabilities to the correct format.
weatherAUS_eval <- data.frame(pred_class = weatherAUS_13NN, 
                        pred_prob = prob_knn$prob, 
                        target = as.numeric(weatherAUS_data_test$RainTomorrow))

weatherAUS_pred <- prediction(weatherAUS_eval$pred_prob, weatherAUS_eval$target)
weatherAUS_roc <- performance(weatherAUS_pred, "tpr", "fpr")

ll <- LogLoss(as.numeric(weatherAUS_eval$pred_prob), as.numeric(weatherAUS_data_test$RainTomorrow))
sprintf("LogLoss of Model: %s", ll)

## [1] "LogLoss of Model: 0.748923984593571"

The LogLoss of the Model was 0.749. This is a pretty low LogLoss value, but for any given model a lower log-loss value helps with predictions.

ROC curve

The ROC Curve was then plotted to observe the threshold that could improve the KNN classifications.

plot(weatherAUS_roc, colorize = TRUE)+abline(a=0, b= 1)

## integer(0)

Based on the curve, it looks like the model is generally a good fit, since the curve is well above the average line. Let’s take a look at the AUC value to confirm this.

AUC value

pred <- prediction(as.numeric(weatherAUS_eval$pred_prob), as.numeric(weatherAUS_data_test$RainTomorrow))

knn_perf <- performance(pred,"tpr","fpr")

KNN_perf_AUC <- performance(pred,"auc")

paste("AUC: ", KNN_perf_AUC@y.values)

## [1] "AUC:  0.843139935477045"

In general, an AUC of 0.5 suggests no discrimination (i.e., ability to diagnose patients with and without the disease or condition based on the test), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding.

For our model we have an excellent fit. Let’s adjust the threshold to 0.4, where the ROC curve starts to cutoff, to see the results.

adjusting the threshold and conclusion

Based on the ROC curve, we chose a threshold value of 0.40.

adjust_thres <- function(x, y, z) {
  #x=pred_probablities, y=threshold, z=test_outcome
  thres <- as.factor(ifelse(x > y, 1,0))
  confusionMatrix(thres, z, positive = "1", dnn=c("Prediction", "Actual"), mode = "everything")
}

adjust_thres(weatherAUS_eval$pred_prob, .4, as.factor(weatherAUS_data_test$RainTomorrow))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction     0     1
##          0 16238  2239
##          1  1325  2732
##                                          
##                Accuracy : 0.8418         
##                  95% CI : (0.837, 0.8466)
##     No Information Rate : 0.7794         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5076         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.5496         
##             Specificity : 0.9246         
##          Pos Pred Value : 0.6734         
##          Neg Pred Value : 0.8788         
##               Precision : 0.6734         
##                  Recall : 0.5496         
##                      F1 : 0.6052         
##              Prevalence : 0.2206         
##          Detection Rate : 0.1212         
##    Detection Prevalence : 0.1800         
##       Balanced Accuracy : 0.7371         
##                                          
##        'Positive' Class : 1              
##

The confusion matrix resulted in an accuracy of 84.18% and a Kappa of 0.5076, which indicates that some values may have shifted slightly in the total analysis.

In fact, the sensitivity value rose from 47.37% to 54.96%, and specificity value decreased negligibly from 95.02% to 92.46%- which means that the threshold definitely improved the KNN classification model. Another plus is that the F1 score increased from 0.5744 to 0.6052, which confirms that the KNN model has been optimized by the threshold function. This improves the overall true-positive rate of classifying that rain does fall on a subsequent day in Australia. While there still is a trade off between the true-positive rate and the true-negative rate, given the accuracy of the model, it is safe to say that the classification generally works.

To help with classifying weather data in Australia in the future, the categorical variables that were removed, such as location, date, and directions, could possbibly be one-hot encoded to help with increasing the accuracy of prediction. While the cost would be that the size of the overall data structure would increase, it would likely help with improving the true positive rate of the model. A threshold measure at the end would still be necessary to evaluate and improve the classification, but as of now we have a working model that generates an acceptable true-positive true-negative ratio.

sources

Our first dataset detailed metrics on heart disease. This dataset contains 14 of the original 76 attributes. target is the target value to predict, where 1 = a higher risk of heart failure, 0 = a lower risk of heart failure.

Data Source: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Kaggle Link: https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

Our second dataset detailed weather findings in Australia. This dataset contains about 10 years of daily weather observations from many locations across Australia. RainTomorrow is the target variable to predict. It means – did it rain the next day, Yes or No? This column is Yes if the rain for that day was 1mm or more.

Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data.

An example of latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml

Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data.