In this lesson students will learn …
Motivation/Background:
Sources:
pima<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/diabetes.csv",
header=TRUE)
str(pima)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Our goal in this step is to find feature variables that best separates the two groups (0 = diabetes negative; 1 = diabetes positive).
We might want to use side-by-side boxplot or overlapping density plots.
library(tidyverse)
## BLOOD PRESSURE vs GLOCOSE
ggplot(data=pima, aes(x=Glucose, y=BloodPressure, color=factor(Outcome)))+
geom_point()
## GLUCOSE BOXPLOT
ggplot(data=pima, aes(x=Glucose ,fill=factor(Outcome)))+
geom_boxplot()
## GLUCOSE DENSITY
ggplot(data=pima, aes(x=Glucose ,fill=factor(Outcome)))+
geom_density(alpha=.5)
We want to explore relationships between whether or not an individual
has diabetes, which is contained in the Outcome
variable,
and other feature variables.
We’ve made a lot of scatterplot thus far. How about we try that?
ggplot(data=pima, aes(x=Glucose, y=Outcome))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
That didn’t work out so well.
In the previous problem we encountered issues because we were
treating Outcome
as a number when it should have been
treated as a categorical variable (or factor).
KNN is a popular, non-parametric, “lazy” machine learning approach. Predictions are built off defining a neighborhood of \(k\) close neighbors, where closeness is defined by a pre-specified distance metric. Thus, \(k\) is considered a hyperparameter than can be tuned to choose the appropriate model flexibility, whilst considering the bias and variance trade-off.
Look at the summary of our data
summary(pima)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Calculating distance is central to this algorithm. Therefore, if features have values with large difference in ranges or units, this can highly influence the model. Therefore, we should pre-process the data first.
Consider the following methods:
\[X_{sc}=\frac{X-X_{min}}{X_{max}-X_{min}}\]
\[Z=\frac{X-\bar{X}}{SD(X)}\]
##### scale
#Normalization
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }
## REMOVE NAs FIRST
pima<-na.omit(pima)
pimaNorm <- as.data.frame(lapply(pima[,1:8], normalize))%>%
cbind(Outcome=pima$Outcome)
head(pimaNorm)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0.35294118 0.7437186 0.5901639 0.3535354 0.0000000 0.5007452
## 2 0.05882353 0.4271357 0.5409836 0.2929293 0.0000000 0.3964232
## 3 0.47058824 0.9195980 0.5245902 0.0000000 0.0000000 0.3472429
## 4 0.05882353 0.4472362 0.5409836 0.2323232 0.1111111 0.4187779
## 5 0.00000000 0.6884422 0.3278689 0.3535354 0.1985816 0.6423249
## 6 0.29411765 0.5829146 0.6065574 0.0000000 0.0000000 0.3815201
## DiabetesPedigreeFunction Age Outcome
## 1 0.23441503 0.4833333 1
## 2 0.11656704 0.1666667 0
## 3 0.25362938 0.1833333 1
## 4 0.03800171 0.0000000 0
## 5 0.94363792 0.2000000 1
## 6 0.05251921 0.1500000 0
Let’s make a 70-30 split for training and testing out model.
768*.7
## [1] 537.6
set.seed(999)
sampleInd<-sample(1:768, 537)
train<-pimaNorm[sampleInd,]
test<-pimaNorm[-sampleInd,]
## What proportion people in the training and testing sets have diabetes
mean(train$Outcome)
## [1] 0.3798883
mean(test$Outcome)
## [1] 0.2770563
Wow! There is a 10% difference! We want to have the same proportion of diabetes in the training set as the testing set because it is important that our training data is representative.
Let’s look at that
#### VISUALIZE THIS
pima2<-pimaNorm%>%
mutate(train=FALSE)
pima2$train[sampleInd]<-TRUE
ggplot(data=pima2, aes(x=Glucose, y=BloodPressure, color=factor(Outcome)))+
geom_point()+
facet_grid(.~train)
First, we will build it from scratch by using basic tools in
dplyr
.
### INSTEAD we want the same prop
## PARTITION DATA
pima0<-pimaNorm%>%
filter(Outcome==0)
dim(pima0)
## [1] 500 9
pima1<-pimaNorm%>%
filter(Outcome==1)
dim(pima1)
## [1] 268 9
## SAMPLE INDECES
sample0<-sample(1:500, 350)
sample1<-sample(1:268, 188)
## TRAINING AND TESTING SETS
trainStrat<-rbind(pima0[sample0, ],
pima1[sample1, ])
testStrat<-rbind(pima0[-sample0, ],
pima1[-sample1, ])
## PROPORITON OF OUTCOME
mean(trainStrat$Outcome)
## [1] 0.3494424
mean(testStrat$Outcome)
## [1] 0.3478261
library(caret)
# Split the data into training and test set
set.seed(314)
caretSamp <- createDataPartition(pimaNorm$Outcome ,
p = 0.7,
list = FALSE)
## SPLIT TESTING AND TRAINING
trainCaret <- pimaNorm[caretSamp, ]
testCaret <- pimaNorm[-caretSamp, ]
## PROPORITON OF OUTCOME
mean(trainCaret$Outcome)
## [1] 0.3494424
mean(testCaret$Outcome)
## [1] 0.3478261
The knn
function is within the class
package. Take a moment to read the documentation.
library(class)
help(knn)
We will need to specify four arguments:
train
: matrix or data frame of training set
cases.test
: matrix or data frame of test set cases. A vector
will be interpreted as a row vector for a single case.cl
: factor of true classifications of training
setk
: number of neighbours considered.Let’s specify the arguments
#### look at knn function
## need train, test, cl, and k
trainFea<-trainCaret%>%
select(-Outcome)
dim(trainFea)
## [1] 538 8
testFea<-testCaret%>%
select(-Outcome)
trainOut<-trainCaret$Outcome
testOut<-testCaret$Outcome
Now to implement the algorithm. If there is a “tie” in closeness to the datapoint in question, that tie will be split randomly.
set.seed(1)
knn.pred=knn(train = trainFea,
test = testFea,
cl = trainOut,
k=1)
head(knn.pred)
## [1] 0 1 0 0 0 1
## Levels: 0 1
We assess model fit for classification models with the confusion matrix. This illustrates the concordance (diagonal) and discordance (off-diagonal) between our predictions and test data.
#Confusion matrix
cm<-table(knn.pred,testOut)
cm
## testOut
## knn.pred 0 1
## 0 115 39
## 1 35 41
Thus the correct prediction rate can be calculated using concordance:
### correct rate
mean(knn.pred==testOut)
## [1] 0.6782609
And the error is the complement
### error rate
1-mean(knn.pred==testOut)
## [1] 0.3217391
## OR
mean(knn.pred!=testOut)
## [1] 0.3217391
Other popular metrics include sensitivity and specificity: * Sensitivity = True Positive / (True Positive + False Negative) * Specificity = True Negative / (False Positive + True Negative)
Low values of \(k\) result is too wiggly (overfitting) models that are influenced by noise (high variance). On the other hand, high values of \(k\) are response enough (underfitting) and result in biased models. We can use a grid search to find the best value of \(k\).
## Whats the best k
## Pick a neighborhood
set.seed(123)
error <- c()
for (i in 1:30){
knnPima<- knn(train = trainFea,
test = testFea,
cl = trainOut,
k = i)
error[i] = 1- mean(knnPima==testOut)
}
ggplot(data = data.frame(error), aes(x = 1:30, y = error)) +
geom_line(color = "Blue")+
xlab("Neighborhood Size")
which.min(error)
## [1] 12
## best pred...
knn.pred12=knn(train = trainFea,
test = testFea,
cl = trainOut,
k=12)
#Confusion matrix
cm12<-table(knn.pred12,testOut)
mean(knn.pred12==testOut)
## [1] 0.726087
Caret is a useful tool that standardizes model fitting syntax; however, its a blackbox.
### THE CARET METHOD CARET
set.seed(314)
model <- train(
factor(Outcome) ~.,
data = trainCaret ,
method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("range"),
tuneLength = 20
)
# Plot model accuracy vs different values of k
plot(model)
# Print the best tuning parameter k that
# maximizes model accuracy
model$bestTune
## k
## 4 11
predicted.classes <- model %>% predict(testCaret)
head(predicted.classes)
## [1] 0 1 0 0 1 0
## Levels: 0 1
# Compute model accuracy rate
cmCaret<-table(predicted.classes ,testOut)
cmCaret
## testOut
## predicted.classes 0 1
## 0 120 37
## 1 30 43
### THE CARET WAY!
confusionMatrix(cmCaret)
## Confusion Matrix and Statistics
##
## testOut
## predicted.classes 0 1
## 0 120 37
## 1 30 43
##
## Accuracy : 0.7087
## 95% CI : (0.6454, 0.7666)
## No Information Rate : 0.6522
## P-Value [Acc > NIR] : 0.04036
##
## Kappa : 0.3445
##
## Mcnemar's Test P-Value : 0.46355
##
## Sensitivity : 0.8000
## Specificity : 0.5375
## Pos Pred Value : 0.7643
## Neg Pred Value : 0.5890
## Prevalence : 0.6522
## Detection Rate : 0.5217
## Detection Prevalence : 0.6826
## Balanced Accuracy : 0.6687
##
## 'Positive' Class : 0
##
The iris
dataset is a popular dataset used for
statistical examples. The documentation for this dataset in R
states:
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Our goal is to be able to predict the species of an unknown iris flower, given measurements for four physical characteristics.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Since we are using a distance metric to assess closeness of neighbors, it is essential to standardize our data.
# unstandardized
var(iris[ ,1])
## [1] 0.6856935
var(iris[ ,2])
## [1] 0.1899794
# use scale to standardize
irisS<-iris
irisS[,1:4] <-scale(irisS[,1:4])
irisS<-data.frame(irisS)
# now check the standardized variance
var(irisS[ ,1])
## [1] 1
var(irisS[ ,2])
## [1] 1
Looking at the data do there appear to be any clear groupings?
### Are there groups?
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(irisS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## explore
library(tidyverse)
ggplot(data = irisS, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
ggplot(data = irisS, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
geom_point()
We want to make sure that we have balanced representation in the training and testing sets.
### train and test
set.seed(239)
setosa<- irisS%>%
filter(Species=="setosa")
versicolor<- irisS%>%
filter(Species=="versicolor")
virginica<- irisS%>%
filter(Species=="virginica")
# 50 observations from each
# Take 60% to train (30) and 40% to test (20)
train <- sample(1:50, 30)
iris.train<- rbind(setosa[train,], versicolor[train,], virginica[train,])
iris.test<- rbind(setosa[-train,], versicolor[-train,], virginica[-train,])
library(class)
knnIris<- knn(train = iris.train[,1:4],
test = iris.test[,1:4],
cl = iris.train$Species,
k = 3)
## correct
mean(knnIris==iris.test$Species)
## [1] 0.9833333
## error
mean(knnIris!=iris.test$Species)
## [1] 0.01666667
## Pick a neighborhood
error <- c()
for (i in 1:15)
{
knnIris<- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
error[i] = 1- mean(knnIris == iris.test$Species)
}
ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
geom_line(color = "Blue")+
xlab("Neighborhood Size")
iris_pred <- knn(train = iris.train[,1:4],
test = iris.test[,1:4],
cl = iris.train$Species,
k=3)
table(iris.test$Species, iris_pred)
## iris_pred
## setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 20 0
## virginica 0 1 19