Prompt:

Using Pima Indians diabetes data set. Partition the data set into training set and test set with 80:20 ratios.

We begin this exercise by reading the data into our project from the site. We are also sure to add appropriate names to the data frame.

library(dplyr)
package <U+393C><U+3E31>dplyr<U+393C><U+3E32> was built under R version 3.3.2
Attaching package: <U+393C><U+3E31>dplyr<U+393C><U+3E32>

The following object is masked from <U+393C><U+3E31>package:car<U+393C><U+3E32>:

    recode

The following objects are masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:

    filter, lag

The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    intersect, setdiff, setequal, union

The first thing the prompt has us execute, is splitting the data into a 80-20, test and train data set, which we execute below:

Now that we have our working data set, we can begin to explore some of the properites with basic charts and summaries.

summary(train)
    Pregnant        GlucoseCon      Diastolic        SkinThick        Insulin            BMI             Diab             Age            Class       
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   Min.   :0.0000  
 1st Qu.: 1.000   1st Qu.:100.0   1st Qu.: 62.00   1st Qu.: 0.00   1st Qu.:  0.00   1st Qu.:26.93   1st Qu.:0.2415   1st Qu.:24.00   1st Qu.:0.0000  
 Median : 3.000   Median :117.5   Median : 72.00   Median :23.00   Median : 24.00   Median :32.05   Median :0.3780   Median :29.00   Median :0.0000  
 Mean   : 3.907   Mean   :121.8   Mean   : 68.89   Mean   :20.68   Mean   : 78.86   Mean   :32.03   Mean   :0.4694   Mean   :33.43   Mean   :0.3502  
 3rd Qu.: 6.000   3rd Qu.:142.0   3rd Qu.: 80.00   3rd Qu.:32.00   3rd Qu.:128.75   3rd Qu.:36.60   3rd Qu.:0.6355   3rd Qu.:41.00   3rd Qu.:1.0000  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :744.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00   Max.   :1.0000  
str(train)
'data.frame':   614 obs. of  9 variables:
 $ Pregnant  : int  6 1 4 3 6 1 7 3 1 9 ...
 $ GlucoseCon: int  194 86 76 130 144 121 114 111 103 112 ...
 $ Diastolic : int  78 66 62 64 72 78 76 56 80 82 ...
 $ SkinThick : int  0 52 0 0 27 39 17 39 11 32 ...
 $ Insulin   : int  0 65 0 0 228 74 110 0 82 175 ...
 $ BMI       : num  23.5 41.3 34 23.1 33.9 39 23.8 30.1 19.4 34.2 ...
 $ Diab      : num  0.129 0.917 0.391 0.314 0.255 0.261 0.466 0.557 0.491 0.26 ...
 $ Age       : int  59 29 25 22 40 28 31 30 22 36 ...
 $ Class     : int  1 0 0 0 0 0 0 0 0 1 ...
plot(train)

ages<-train %>% group_by(as.factor(Age),as.factor(Class)) %>% summarise(Class = n())
names(ages)<-c("Age","Class","NumberOfPeople")
ggplot(ages,aes(Age,NumberOfPeople))+geom_bar(aes(fill = Class), position = "dodge", stat='identity')+ 
  scale_x_discrete(breaks=seq(21, 81, 2))+labs(title="Number of Diabeteics vs Non-Diabetics by Age")

pregs<-train %>% group_by(as.factor(Pregnant),as.factor(Class)) %>% summarise(Class = n())
names(pregs)<-c("TimesPregnant","Class","NumberOfPeople")
ggplot(pregs,aes(TimesPregnant,NumberOfPeople))+geom_bar(aes(fill = Class), position = "dodge", stat='identity')+ 
  scale_x_discrete(breaks=seq(0, 17, 2))+labs(title="Number of Diabeteics vs Non-Diabetics by Number of Times Pregnant")

Ins<-train %>% group_by(as.factor(Class)) %>% summarise(Insulin = mean(Insulin))
names(Ins)<-c("Class","AvgInsulinLevel")
ggplot(Ins,aes(Class,AvgInsulinLevel))+geom_bar(stat='identity')+ 
 labs(title="Avg Insulin Levels for Class")

It looks like there is a relationship between a few of the variables that we looked at, but nothing overwhelming stands out. In the general plots, there are several giant clusters that are hard to distinguish, but there could be something there. Let’s now begin with the “adabag” package as recommended in the prompt to try our hand in classifying this data.

table(baggingTrained$class,train$Class,dnn=c("Predicted Class","Observed Class"))
               Observed Class
Predicted Class   0   1
              0 363 193
              1  36  22

This did not perform very well on itself. Something to point out at this point is that there are far more 0 classification records than 1’s. This model appears to have voted heavily for class 0 when it should be class 1. The next step is to apply this model to the test set from before and get a few metrics to see how it does. We will evaluate this model based on the confusion matrix, accuracy, and average error.

baggingTest$confusion
               Observed Class
Predicted Class  0  1
              0 87 27
              1 14 26
baggingTest$error
[1] 0.2662338
accuracy<-1-(14+27)/154
accuracy
[1] 0.7337662

This package does not have a great way to look at accuacy, so I will do a few calculations manually.

Specificity: When there is no diabetes, how often do we predict no diabetes?

Precision: When we did predict diabetes, how often is it correct?

This number is particularly weak. The prompt would have us do 10-fold cross validation with our bagging model. Let’s see if that makes a difference for our precision.

table(baggingTrained10Fold$class,train$Class,dnn=c("Predicted Class","Observed Class"))
               Observed Class
Predicted Class   0   1
              0 368 198
              1  31  17
baggingTest10Fold<- predict.bagging(baggingTrained10Fold, newdata = test)
baggingTest10Fold$confusion
               Observed Class
Predicted Class  0  1
              0 89 27
              1 12 26
baggingTest10Fold$error
[1] 0.2532468
accuracy<-1-(12+27)/154
accuracy
[1] 0.7467532

Specificity: When there is no diabetes, how often do we predict no diabetes?

Precision: When we did predict diabetes, how often is it correct?

This performed slightly better with 2 fewer misclassifications, but does not yet perform well. Let’s move to boosting, continuing with the adabag package for consitency.

table(boostTrain$class,train$Class,dnn=c("Predicted Class","Observed Class"))
               Observed Class
Predicted Class   0   1
              0 383  20
              1  16 195
boostTest<- predict.boosting(boostTrain, newdata = test)
boostTest$confusion
               Observed Class
Predicted Class  0  1
              0 83 23
              1 18 30
boostTest$error
[1] 0.2662338
accuracy<-1-(18+23)/154
accuracy
[1] 0.7337662

Our boosting study appears to have about the same accuracy as our previous bagging models. One thing to point out is that when comparing the training model to the training classification, that confusion matrix appears that represent a higher accuracy that our bagging models represented when looking at the training data.

Specificity: When there is no diabetes, how often do we predict no diabetes?

Precision: When we did predict diabetes, how often is it correct?

One take away from these metrics is that this prediciton had more guesses for 1, a positive diabetes classification, though overall, it was less accurate in doing so. It could be the case that our model is over fitted based on our training set. As recommended in the prompt, we can try using 10-fold cross validation to account for this problem. Let’s see if it works for our diabetes classifications.

table(boostTrain10Fold$class,train$Class,dnn=c("Predicted Class","Observed Class"))
               Observed Class
Predicted Class   0   1
              0 386  16
              1  13 199
boostTest10Fold<- predict.boosting(boostTrain10Fold, newdata = test)
boostTest10Fold$confusion
               Observed Class
Predicted Class  0  1
              0 81 24
              1 20 29
boostTest10Fold$error
[1] 0.2857143
accuracy<-1-(20+24)/154
accuracy
[1] 0.7142857

Specificity: When there is no diabetes, how often do we predict no diabetes?

Precision: When we did predict diabetes, how often is it correct?

Interestingly, our first example had higher precision and specificity. That just goes to show why the 10-fold validation is important. OUr first model performed better than we should expect.

Lets compare the variable importance when looking at all 4 of the models to compare what they look at before evaluating them for performance

baggingTrained$importance
       Age        BMI       Diab  Diastolic GlucoseCon    Insulin   Pregnant  SkinThick 
  9.482231  17.778772   6.637635   4.093818  51.507081   2.937476   4.986332   2.576655 
baggingTrained10Fold$importance
       Age        BMI       Diab  Diastolic GlucoseCon    Insulin   Pregnant  SkinThick 
 10.736659  14.539231  10.203879   2.454334  50.122278   3.226829   4.299985   4.416806 
boostTrain$importance
       Age        BMI       Diab  Diastolic GlucoseCon    Insulin   Pregnant  SkinThick 
 11.246099  19.916713  13.336090   7.126663  27.145441   6.857899   8.141832   6.229262 
boostTrain10Fold$importance
       Age        BMI       Diab  Diastolic GlucoseCon    Insulin   Pregnant  SkinThick 
 12.644516  20.444870  15.080553   8.405236  21.370252   8.042423   6.450498   7.561652 
barplot(baggingTrained$imp[order(baggingTrained$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Bagging Relative Importance", col = "lightblue")

barplot(baggingTrained10Fold$imp[order(baggingTrained10Fold$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Bagging 10 Fold Cross Validated Relative Importance", col = "lightblue")

barplot(boostTrain$imp[order(boostTrain$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Boosting Relative Importance", col = "lightblue")

barplot(boostTrain10Fold$imp[order(boostTrain10Fold$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Boosting 10 Fold Cross Validated Relative Importance", col = "lightblue")

You can see that these algorithms vary in terms of much the value certain variables. In this case, boosting has much less variance in the weight of each different variable, where bagging shows that GlucoseCon is signifiacntly more important. Accuracy In review:

The highest performing was Baggin with the validation, but we can also evaluate based on Precision.

If we were to conclude our findings here, the best model to use would be Bagging with Validation once again. If we were to continue looking at new models. This first thing I would do is use a different package. The adabag package is very limited in its capabilities. In terms of modelling, the next steps would be to see if we can better predict diabetes if we remove some of the variables that are not found to be important. Another technique we could use, is to change the sample set from the beginning to make sure our sample is representative of the population. If we were to go more in depth, I would like to evaluate these models and further models with ROC curves to better determine their relational accuracy.

---
title: "Week 4 Pima Indians Diabetes"
output: html_notebook
author: John Neville
---

Prompt:

Using Pima Indians diabetes data set. 
Partition the data set into training set and test set with 80:20 ratios.

+ Get to know your data, start out with data exploration. Summarize your findings.
+ Use the bagging method (e.g. adabag  package or ipred package) to train the training data set. Then, predict the model on the test set. Display the confusion matrix, its accuracy, and average error from the predicted results.
+ Perform 10-fold cross-validation using bagging and report the performance. Compare the result with 2)
+ Use the boosting method (e.g. adabag package or caret package) to train the training data set. Then, predict the model on the test set.  Display the confusion matrix, its accuracy, and average error from predicted results.
+ Perform 10-fold cross-validation with the boosting method and report the performance. Compare the result with 4)
+ Compare all the results. Conclude and discuss your finding.
+ Compare these two techniques. What is your conclusion? Discuss (include any tables/graphs that corresponding to your reasoning).
+ From the analysis results, what are your recommend actions?
+ Please submit your report and source code (.r) in the drop box by Apr. 9

We begin this exercise by reading the data into our project from the site.  We are also sure to add appropriate names to the data frame.

```{r}
library(ggplot2)
library(dplyr)

file <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
 
raw_data <- read.csv(file, header=F)
names<-c("Pregnant","GlucoseCon","Diastolic","SkinThick","Insulin","BMI","Diab","Age","Class")

names(raw_data)<-names

```

The first thing the prompt has us execute, is splitting the data into a 80-20, test and train data set, which we execute below:
 

```{r}
set.seed(25)  # For reproducibility
smpl <- floor(0.80 * nrow(raw_data)) # Specify ratio of sample split
train_ind <- sample(seq_len(nrow(raw_data)), size = smpl) # Making the index
train <- raw_data[train_ind, ] # Assign training data
test <- raw_data[-train_ind, ] # Assign testing data

```

Now that we have our working data set, we can begin to explore some of the properites with basic charts and summaries.

```{r}
summary(train)
str(train)
plot(train)

ages<-train %>% group_by(as.factor(Age),as.factor(Class)) %>% summarise(Class = n())
names(ages)<-c("Age","Class","NumberOfPeople")
ggplot(ages,aes(Age,NumberOfPeople))+geom_bar(aes(fill = Class), position = "dodge", stat='identity')+ 
  scale_x_discrete(breaks=seq(21, 81, 2))+labs(title="Number of Diabeteics vs Non-Diabetics by Age")

pregs<-train %>% group_by(as.factor(Pregnant),as.factor(Class)) %>% summarise(Class = n())
names(pregs)<-c("TimesPregnant","Class","NumberOfPeople")
ggplot(pregs,aes(TimesPregnant,NumberOfPeople))+geom_bar(aes(fill = Class), position = "dodge", stat='identity')+ 
  scale_x_discrete(breaks=seq(0, 17, 2))+labs(title="Number of Diabeteics vs Non-Diabetics by Number of Times Pregnant")

Ins<-train %>% group_by(as.factor(Class)) %>% summarise(Insulin = mean(Insulin))
names(Ins)<-c("Class","AvgInsulinLevel")
ggplot(Ins,aes(Class,AvgInsulinLevel))+geom_bar(stat='identity')+ 
 labs(title="Avg Insulin Levels for Class")

```


It looks like there is a relationship between a few of the variables that we looked at, but nothing overwhelming stands out.  In the general plots, there are several giant clusters that are hard to distinguish, but there could be something there. Let's now begin with the "adabag" package as recommended in the prompt to try our hand in classifying this data.
```{r}
library('adabag')

train$Class<-as.factor(train$Class)
baggingTrained<-bagging(Class~.,data=train,mfinal=10,boos= FALSE)

table(baggingTrained$class,train$Class,dnn=c("Predicted Class","Observed Class"))

```
This did not perform very well on itself.  Something to point out at this point is that there are far more 0 classification records than 1's.  This model appears to have voted heavily for class 0 when it should be class 1.  The next step is to apply this model to the test set from before and get a few metrics to see how it does.  We will evaluate this model based on the confusion matrix, accuracy, and average error.  




```{r}
baggingTest<- predict.bagging(baggingTrained, newdata = test)
baggingTest$confusion

baggingTest$error

accuracy<-1-(14+27)/154

accuracy

```
This package does not have a great way to look at accuacy, so I will do a few calculations manually.

Specificity: When there is no diabetes, how often do we predict no diabetes?

+ 87/101 = 86%

Precision: When we did predict diabetes, how often is it correct?

+ 26/40 = 65%

This number is particularly weak.
The prompt would have us do 10-fold cross validation with our bagging model.  Let's see if that makes a difference for our precision.

```{r}

baggingTrained10Fold<-bagging(Class~.,data=train,v=10,mfinal=10,boos= FALSE)

table(baggingTrained10Fold$class,train$Class,dnn=c("Predicted Class","Observed Class"))
baggingTest10Fold<- predict.bagging(baggingTrained10Fold, newdata = test)
baggingTest10Fold$confusion

baggingTest10Fold$error

accuracy<-1-(12+27)/154

accuracy
```


Specificity: When there is no diabetes, how often do we predict  no diabetes?

+ 89/101 = 88%

Precision: When we did predict diabetes, how often is it correct?

+ 26/38 = 68%

This performed slightly better with 2 fewer misclassifications, but does not yet perform well.
Let's move to boosting, continuing with the adabag package for consitency.

```{r}
boostTrain <- boosting(Class ~ ., data = train, mfinal = 10)

table(boostTrain$class,train$Class,dnn=c("Predicted Class","Observed Class"))
boostTest<- predict.boosting(boostTrain, newdata = test)
boostTest$confusion
boostTest$error
accuracy<-1-(18+23)/154
accuracy
```

Our boosting study appears to have about the same accuracy as our previous bagging models.
One thing to point out is that when comparing the training model to the training classification, that confusion matrix appears that represent a higher accuracy that our bagging models represented when looking at the training data.

Specificity: When there is no diabetes, how often do we predict no diabetes?

+ 83/101 = 82%

Precision: When we did predict diabetes, how often is it correct?

+ 30/48 = 63%

One take away from these metrics is that this prediciton had more guesses for 1, a positive diabetes classification, though overall, it was less accurate in doing so.  It could be the case that our model is over fitted based on our training set.  As recommended in the prompt, we can try using 10-fold cross validation to account for this problem.  Let's see if it works for our diabetes classifications.

```{r}
boostTrain10Fold <- boosting(Class ~ ., data = train,v=10, mfinal = 10)

table(boostTrain10Fold$class,train$Class,dnn=c("Predicted Class","Observed Class"))
boostTest10Fold<- predict.boosting(boostTrain10Fold, newdata = test)
boostTest10Fold$confusion
boostTest10Fold$error
accuracy<-1-(20+24)/154
accuracy
```


Specificity: When there is no diabetes, how often do we predict no diabetes?

+ 81/101 = 80%

Precision: When we did predict diabetes, how often is it correct?

+ 29/49 = 59%

Interestingly, our first example had higher precision and specificity.  That just goes to show why the 10-fold validation is important.  OUr first model performed better than we should expect.

Lets compare the variable importance when looking at all 4 of the models to compare what they look at before evaluating them for performance

```{r}
baggingTrained$importance
baggingTrained10Fold$importance
boostTrain$importance
boostTrain10Fold$importance

barplot(baggingTrained$imp[order(baggingTrained$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Bagging Relative Importance", col = "lightblue")

barplot(baggingTrained10Fold$imp[order(baggingTrained10Fold$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Bagging 10 Fold Cross Validated Relative Importance", col = "lightblue")

barplot(boostTrain$imp[order(boostTrain$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Boosting Relative Importance", col = "lightblue")


barplot(boostTrain10Fold$imp[order(boostTrain10Fold$imp, decreasing = TRUE)], ylim = c(0, 100), main = "Boosting 10 Fold Cross Validated Relative Importance", col = "lightblue")

```

You can see that these algorithms vary in terms of much the value certain variables.  In this case, boosting has much less variance in the weight of each different variable, where bagging shows that GlucoseCon is signifiacntly more important.
Accuracy In review:

+ Bagging:  73.4%
+ Bagging w Validation:  74.7%
+ Boosting:  73.4%
+ Boosting w Validation:  71.4%

The highest performing was Baggin with the validation, but we can also evaluate based on Precision.

+ Bagging:  65%
+ Bagging w Validation:  68%
+ Boosting:  63%
+ Boosting w Validation:  59%

If we were to conclude our findings here, the best model to use would be Bagging with Validation once again.  If we were to continue looking at new models.  This first thing I would do is use a different package.  The adabag package is very limited in its capabilities.
In terms of modelling, the next steps would be to see if we can better predict diabetes if we remove some of the variables that are not found to be important.  Another technique we could use, is to change the sample set from the beginning to make sure our sample is representative of the population.
If we were to go more in depth, I would like to evaluate these models and further models with ROC curves to better determine their relational accuracy.


