Introduction to Support Vector Machines


Support Vector classification attempts to construct a hyperplane that will correctly segregate the data. In two dimensions the hyperplane is defined by:

\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]

and in p-dimensions by:

\[\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p =0\]

If the equation is >0 then X lies on one side of the hyperplane and if it is <0 then X lies on the other side.

For linear classification the model finds the maximal margin hyperplane. The margin is the distance from the training observations to the separating hyperplane (the model maximizes this distance). This can be visualized in the graphic below.

The two classes are shown in red and green. The maximal margin hyperplane is shown as a solid line and the margin is the distance from the solid line to either of the dashed lines. The points that lie on the dashed line are support vectors. The red and green grid represents the decision rule made by the classifier.

We can see that there is one green point on the red side (misclassified). This was allowed by the model because I used a tuning parameter, C, which determines the number of points that can violate the margin. As C increases the margin will widen and we become more tolerant of violations. If C is large than the classifier will have low variance and high bias, but if C is small than the classifier will have high variance and low bias.

R Code for Sample Hyperplane


library(e1071)
library(class)
library(caret)
library(klaR)
library(bestglm)

set.seed(1)
x=matrix(rnorm(20*2),ncol = 2)
y=c(rep(-1,10),rep(1,10))
x[y==1,]=x[y==1,]+1
dat=data.frame(x=x,y=as.factor(y))
svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)
  1. I generated 40 random values from a normal distribution (2 columns of 20)
  2. The y variable was added as 10 rows of -1 and 10 rows of 1
  3. Where y=1 the x values were increased by 1
  4. The svm function from the “e1071” and “class” packages was used to calculate the model
make.grid = function(x, n=75) {
  grange = apply(x,2,range)
  x1 = seq(from = grange[1,1], to=grange[2,1],length=n)
  x2 = seq(from = grange[1,2], to=grange[2,2], length=n)
  expand.grid(X1 = x1, X2 = x2)
}

This function requires a matrix as the x input. It then takes the range of the column values and generates a sequence of 75 values from the low to the high. Finally, expand.grid generates every combination of the x1 and x2 sequence values.

xgrid = make.grid(x)
colnames(xgrid) <- c("x.1","x.2")
ygrid = predict(svmfit, xgrid)
plot(xgrid, col=c("red","green")[as.numeric(ygrid)],pch = ".", cex=1, main="Sample Hyperplane")
palette(c("red","green","green"))
points(x,col=(2+y),pch=19)  
}

I made a grid with the x matrix and used the predict function to generate a prediction for each point. I then plotted the grid with the x values as points.

beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

Unfortunately it is a bit difficult to extract the beta coefficients from the svm function, but the formulas above retrieve the beta values. We can then use the hyperplane equation

\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]

to solve for the slope and intercept of the decision boundary (plotted with abline).

Part 1: Pitney Bowes (PBI) data

SVM to categorize high risk or low risk days for Pitney Bowes (PBI)


PBI <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/PBI.csv")
library(knitr)
library(kableExtra)
kable(PBI[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
Date Open High Low Close Adj.Close Volume PBIret
1/3/2007 46.40 46.49 45.87 46.37 23.43 964700 NA
1/4/2007 46.35 46.52 45.89 46.32 23.40 1068400 -0.11
1/5/2007 46.34 46.49 46.19 46.40 23.44 728900 0.17
1/8/2007 46.35 47.06 46.25 46.86 23.67 1054000 0.99
1/9/2007 47.15 47.55 47.06 47.39 23.94 1176800 1.13
1/10/2007 47.24 47.95 46.91 47.70 24.10 1798700 0.65

I began by importing some basic stock data for PBI, from 1/3/07 to 2/5/19.

#Create Range and Risk columns
PBI$Range = PBI$High - PBI$Low
a = median(PBI$Range)
PBI$Risk = ifelse(PBI$Range<=a,"LowRisk","HighRisk")

#Shift Range down 1 to create Lag1
PBI$Lag1 = c(0,PBI[-nrow(PBI),9])
#Shift Range down 1 to create Lag2
PBI$Lag2 = c(0,PBI[-nrow(PBI),11])

#Create Logs
PBI$Log1 = log(PBI$Lag1)
PBI$Log2 = log(PBI$Lag2)

#Seperate 3 columns and delete top 2 rows
PBIa = as.data.frame(PBI[3:nrow(PBI),c(13,14,10)])
kable(PBIa[1:6,],digits=2)%>% kable_styling(bootstrap_options = "responsive")
Log1 Log2 Risk
3 -0.46 -0.48 LowRisk
4 -1.20 -0.46 HighRisk
5 -0.21 -1.20 HighRisk
6 -0.71 -0.21 HighRisk
7 0.04 -0.71 HighRisk
8 -0.80 0.04 LowRisk

If the range of intra-day prices was below the median of all the days it was labeled LowRisk and if it was above it was labeled HighRisk. I then created a 1 day and 2 day lag of the range data and took the log of each.

y=as.factor(PBIa$Risk)
kable(contrasts(y),col.names = NULL)%>% kable_styling(full_width=F, position = "left")
HighRisk 0
LowRisk 1

As a factor, HighRisk is a 0 and LowRisk is a 1

palette(c("red","green"))
plot(PBIa$Log1,PBIa$Log2,col=y,main="Plot of Log Range",
     ylab="Log of Lag2",xlab="Log of Lag1")

We can see that LowRisk days are clustered on the lower left and HighRisk on the upper right.

PBIb = data.frame(PBIa[,1:2],y)
set.seed(1234)
train = sample(3042,2000)

I needed to create a new data frame with y as a factor (as needed by the svm function) and I took a random sample of 2000 from the total of 3042 for my training data.

svmfit=svm(y~.,data=PBIb[train,],kernel="linear",cost=10,scale=FALSE)
summary(svmfit)

Call:
svm(formula = y ~ ., data = PBIb[train, ], kernel = "linear", 
    cost = 10, scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 
      gamma:  0.5 

Number of Support Vectors:  1239

 ( 620 619 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I used the svm function with the training data, a linear kernel, a cost of 10 (the tuning parameter) and I did not scale the data. We can see from the summary that there are 1239 support vectors, with 620 from one class and 619 from the other)

tune.out=tune(svm,y~.,data=PBIb[train,],kernel="linear",
              ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
 0.01

- best performance: 0.2705 

- Detailed performance results:
   cost  error dispersion
1 1e-03 0.2720 0.02898275
2 1e-02 0.2705 0.03086260
3 1e-01 0.2720 0.03409138
4 1e+00 0.2720 0.03301515
5 5e+00 0.2725 0.03285067
6 1e+01 0.2725 0.03285067
7 1e+02 0.2720 0.03301515
8 1e+03 0.2960 0.07712904

I then used the tune function to find the best model (uses cross validation and different tuning parameters). It found that a tuning parameter of 0.01 produced the best model.

svmfit=svm(y~.,data = PBIb[train,],kernel="linear",cost=tune.out$best.parameters,scale=FALSE)
x=as.matrix(PBIb[train,1:2])
xgrid=make.grid(x)
colnames(xgrid)<- c("Log1","Log2")
ygrid = predict(svmfit, xgrid)
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
     main="SVM classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
ColorY=PBIb[train,3]
palette(c("red","green"))
points(PBIb[train,1:2],col=ColorY,pch=19)
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

I ran the best model (cost = 0.01) and plotted the results. We can see that while the hyperplane does not perfectly split the data, which would be impossible, it does seems to split it at the ideal location. Red represents HighRisk and green represents LowRisk.

table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
          prediction
truth      HighRisk LowRisk
  HighRisk      370     155
  LowRisk       130     387
confusion.matrix=table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2735125

We can now use the predict function with best model to predict HighRisk and LowRisk days. Our test data will be the 1042 rows that were excluded from the training dataset (coded as -train). When viewed in the table format we can see the number of correct and incorrect classifications. The error rate was 0.2735125

Naïve Bayes to categorize high risk or low risk days for Pitney Bowes (PBI)


X.Train = PBIb[train,1:2]
Y.Train = PBIb[train,3]
X.Test = PBIb[-train,1:2]
Y.Test = PBIb[-train,3]
model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
ygrid = predict(model$finalModel,xgrid)$class
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
     main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")

I ran the Naïve Bayes model with cross validation (using the training data). I then plotted the xgrid predictions to view the classification boundary. We can see that it is not linear, as it was with SVM.

plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
     main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
ColorY = PBIb[train,3]
palette(c("red","green"))
points(PBIb[train,1:2],col=ColorY,pch=19)

Here we can see how the HighRisk and LowRisk points compare to the boundary.

table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
          Prediction
truth      HighRisk LowRisk
  HighRisk      367     158
  LowRisk       129     388
confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2754319

With the Naïve Bayes classification boundary the error rate was 0.2754319

Logistic Regression to categorize high risk or low risk days for Pitney Bowes (PBI)


glm.fit = glm(y ~ Log1 + Log2, data=PBIb[train,], family=binomial)
glm.probs = predict(glm.fit,PBIb[-train,],type="response")
glm.forecast = PBIb[-train,3]
glm.forecast[glm.probs<.5]="HighRisk"
glm.forecast[glm.probs>.5]="LowRisk"
plot(PBIb[-train,1],PBIb[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
     main="Logistic Regression Forecast",ylab="Log of Lag2",xlab="Log of Lag1", pch=20,cex=1)

I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.

table(truth=PBIb[-train,3],prediction=glm.forecast)
          prediction
truth      HighRisk LowRisk
  HighRisk      378     147
  LowRisk       140     377
confusion.matrix=table(truth=PBIb[-train,3],prediction=glm.forecast)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2754319

Logistic regression had an error rate of 0.2754319

Part2: County Health Data

Introduction to County Health Dataset


I have chosen to study health data from “The County Health Rankings & Roadmaps” program, which is a collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute. The data can be found at the following web address:

County Health Data

Data Elements Code Description
FIPS FIPS Federal Information Processing Standard
State State State
County County County
Years of Potential Life Lost Rate YPLL Age-adjusted YPLL rate per 100,000
% Fair/Poor PFP Percentage of adults that report fair or poor health
Physically Unhealthy Days PUD Average number of reported physically unhealthy days per month
Mentally Unhealthy Days MUD Average number of reported mentally unhealthy days per month
% LBW LBW Percentage of births with low birth weight (<2500g)
% Smokers Smoke Percentage of adults that reported currently smoking
% Obese Obese Percentage of adults that report BMI >= 30
Food Environment Index FEI Indicator of access to healthy foods - 0 is worst, 10 is best
% Physically Inactive PI Percentage of adults that report no leisure-time physical activity
% With Access WA Percentage of the population with access to places for physical activity
% Excessive Drinking ED Percentage of adults that report excessive drinking
Impaired AI Percentage of driving deaths with alcohol involvement
Chlamydia Rate CR Chlamydia cases per 100,000 population
Teen Birth Rate TB Births per 1,000 females ages 15-19
% Uninsured Unis Percentage of people under age 65 without insurance
PCP Rate PCP Primary Care Physicians per 100,000 population
Dentist Rate Dent Dentists per 100,000 population
MHP Rate MHP Mental Health Providers per 100,000 population
Preventable Hosp. Rate PHR Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees
% Screened PS Percentage of female Medicare enrollees having an annual mammogram (age 65-74)
% Vaccinated PV Percentage of annual Medicare enrollees having an annual flu vaccination
Graduation Rate GR Graduation rate
% Some College SC Percentage of adults age 25-44 with some post-secondary education
% Unemployed Unem Percentage of population ages 16+ unemployed and looking for work
% Children in Poverty CIP Percentage of children (under age 18) living in poverty
Income Ratio IR Ratio of household income at the 80th percentile to income at the 20th percentile
% Single-Parent Households SPH Percentage of children that live in single-parent households
Association Rate AR Associations per 10,000 population
Violent Crime Rate VCR Violent crimes per 100,000 population
Injury Death Rate IDR Injury mortality rate per 100,000
Average Daily PM ADPM Average daily amount of fine particulate matter in micrograms per cubic meter
Presence of violation POV County affected by a water violation: 1-Yes, 0-No
% Severe Housing Problems SHP Percentage of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities
% Drive Alone PDA Percentage of workers who drive alone to work
% Long Commute - Drives Alone PLC Among workers who commute in their car alone, the percentage that commute more than 30 minutes

The dataset has 38 columns and 3,142 rows.

I intend to focus on Years of Potential Life Lost Rate (YPPL) as my dependent variable. According to Wikipedia

“Years of potential life lost (YPLL) or potential years of life lost (PYLL), is an estimate of the average years a person would have lived if he or she had not died prematurely.[1] It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people. An alternative is to consider the effects of both disability and premature death using disability adjusted life years.”

It is calculated by subtracting the person’s age at death from a reference age, like 75. A death at 60 would result in 15 years of potential lost life. Negative numbers are not included, so a death at 80 is calculated as 0.

I will categorize a value above the column median as High Risk and a value below the median as Low Risk, to signify the relative risk of dying early for a person in that county.

CHD <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/CHD.csv")
FIPS State County YPLL PFP PUD MUD LBW Smoke Obese FEI PI WA
1001 Alabama Autauga 8824 18 4.2 4.3 8 19 38 7.2 31 69
ED AI CR TB Unis PCP Dent MHP PHR PS PV GR SC
17 29 341.2 27 9 42 32 16 6599 44 41 90 61
Unem CIP IR SPH AR VCR IDR ADPM POV SHP PDA PLC
3.9 19 4.6 25 12.6 272 74 11.7 No 15 86 38
pairs(CHD[,4:12])

pairs(CHD[,c(4,13:20)])

pairs(CHD[,c(4,21:28)])

pairs(CHD[,c(4,29:36)])

We can see from the scatterplots that YPPL tends to be correlated with healthiness measures like smoking, obesity, injuries, etc.

SVM to Categorize High Risk or Low Risk YPLL for US Counties


I began the analysis by starting with two x-variables, Smoking and Obesity. I logged both and added HighRisk/LowRisk as a factor.

CHD2 = CHD[,c(4,9,10)]
CHD2 <- na.omit(CHD2)
a = median(CHD2$YPLL)
CHD2$Risk = ifelse(CHD2$YPLL<=a,"LowRisk","HighRisk")
CHD2$LogSmoke = log(CHD2$Smoke)
CHD2$LogObese = log(CHD2$Obese)
y=as.factor(CHD2$Risk)
CHD2 = data.frame(CHD2[,5:6],y)
kable(CHD2[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive",full_width=F, position = "left")
LogSmoke LogObese y
2.94 3.64 HighRisk
2.83 3.43 LowRisk
3.09 3.78 HighRisk
3.00 3.64 HighRisk
3.00 3.53 HighRisk
3.14 3.66 HighRisk
set.seed(12345)
train = sample(2908,1500)
tune.out=tune(svm,y~.,data=CHD2[train,],kernel="linear",
              ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
 0.01

- best performance: 0.2526667 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.2600000 0.03527668
2 1e-02 0.2526667 0.02693385
3 1e-01 0.2546667 0.02606734
4 1e+00 0.2560000 0.02688797
5 5e+00 0.2560000 0.02688797
6 1e+01 0.2560000 0.02688797
7 1e+02 0.2560000 0.02688797
8 1e+03 0.2560000 0.02688797

I then took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.01.

svmfit=svm(y~.,data=CHD2[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
summary(svmfit)

Call:
svm(formula = y ~ ., data = CHD2[train, ], kernel = "linear", 
    cost = tune.out$best.parameters[1, 1], scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.01 
      gamma:  0.5 

Number of Support Vectors:  1468

 ( 734 734 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I ran the model with the optimal tuning parameter and we can see that there are 1468 support vectors, with 734 from one class and 734 from the other.

x=as.matrix(CHD2[train,1:2])
xgrid=make.grid(x)
colnames(xgrid)<- c("LogSmoke","LogObese")
ygrid = predict(svmfit, xgrid)
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
     main="SVM classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
ColorY=CHD2[train,3]
palette(c("red","green"))
points(CHD2[train,1:2],col=ColorY,pch=19)
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

We can see from the plot that the hyperplane has split the data at the ideal location, with most of the High Risk datapoints clustered in the upper right. It’s no surprise that a higher percent of smoking and obesity leads to higher years of lost life.

table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
          prediction
truth      HighRisk LowRisk
  HighRisk      516     175
  LowRisk       175     542
confusion.matrix=table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2485795

I then used the predict function with the remaining rows of data (the test data, coded as -train) and used the table function to evaluate the outcome. The error rate was 0.2485795

Next, I expanded the analysis to 10 x-variables, as shown below.

Data Elements Code Description
% Fair/Poor PFP Percentage of adults that report fair or poor health
Physically Unhealthy Days PUD Average number of reported physically unhealthy days per month
Mentally Unhealthy Days MUD Average number of reported mentally unhealthy days per month
% Smokers Smoke Percentage of adults that reported currently smoking
% Obese Obese Percentage of adults that report BMI >= 30
% Physically Inactive PI Percentage of adults that report no leisure-time physical activity
Chlamydia Rate CR Chlamydia cases per 100,000 population
Preventable Hosp. Rate PHR Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees
% Children in Poverty CIP Percentage of children (under age 18) living in poverty
Injury Death Rate IDR Injury mortality rate per 100,000
CHD3 = CHD[,c(4,9,10,5,6,7,12,16,33,22,28)]
CHD3 <- na.omit(CHD3)
CHD3[,2:11] <-log(CHD3[,2:11])
kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
YPLL Smoke Obese PFP PUD MUD PI CR IDR PHR CIP
8824 2.94 3.64 2.89 1.44 1.46 3.43 5.83 4.30 8.79 2.94
7225 2.83 3.43 2.89 1.41 1.44 3.18 5.83 4.23 8.25 2.71
a=median(CHD3$YPLL)
CHD3$Risk = ifelse(CHD3$YPLL<=a,"LowRisk","HighRisk")
y=as.factor(CHD3$Risk)
CHD3 = data.frame(CHD3[,2:11],y)
kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
Smoke Obese PFP PUD MUD PI CR IDR PHR CIP y
2.94 3.64 2.89 1.44 1.46 3.43 5.83 4.30 8.79 2.94 HighRisk
2.83 3.43 2.89 1.41 1.44 3.18 5.83 4.23 8.25 2.71 LowRisk
set.seed(12345)
train = sample(2875,1500)
tune.out=tune(svm,y~.,data=CHD3[train,],kernel="linear",
              ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
  0.1

- best performance: 0.1153333 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.1393333 0.03332593
2 1e-02 0.1173333 0.02518083
3 1e-01 0.1153333 0.02352829
4 1e+00 0.1180000 0.02155670
5 5e+00 0.1180000 0.02352829
6 1e+01 0.1180000 0.02352829
7 1e+02 0.1180000 0.02352829
8 1e+03 0.1180000 0.02352829

I took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.1.

svmfit=svm(y~.,data=CHD3[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
summary(svmfit)

Call:
svm(formula = y ~ ., data = CHD3[train, ], kernel = "linear", 
    cost = tune.out$best.parameters[1, 1], scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.1 
      gamma:  0.1 

Number of Support Vectors:  623

 ( 312 311 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I ran the model with the optimal tuning parameter and we can see that there are 623 support vectors, with 312 from one class and 311 from the other.

table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
          prediction
truth      HighRisk LowRisk
  HighRisk      589      84
  LowRisk        77     625
confusion.matrix=table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1170909

I then used the predict function with the remaining rows of data and used the table function to evaluate the outcome. The error rate was an impressive 0.1170909.

Naïve Bayes to Categorize High Risk or Low Risk YPLL for US Counties


set.seed(1234)
train = sample(2908,1500)
X.Train = CHD2[train,1:2]
Y.Train = CHD2[train,3]
X.Test = CHD2[-train,1:2]
Y.Test = CHD2[-train,3]
model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
x=as.matrix(CHD2[train,1:2])
xgrid=make.grid(x)
colnames(xgrid)<- c("LogSmoke","LogObese")
ygrid = predict(model$finalModel,xgrid)$class
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
     main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")

Here we can see the non-linear classification boundary for the two logged variables, smoking and obesity.

plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
     main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
ColorY = CHD2[train,3]
palette(c("red","green"))
points(CHD2[train,1:2],col=ColorY,pch=19)

When the points are plotted along with the boundary we can see how well it performs

table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
          Prediction
truth      HighRisk LowRisk
  HighRisk      513     166
  LowRisk       181     548
confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2464489

With the Naïve Bayes classification boundary the error rate was 0.2464489.

X.Train = CHD3[train,1:10]
Y.Train = CHD3[train,11]
X.Test = CHD3[-train,1:10]
Y.Test = CHD3[-train,11]
model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
          Prediction
truth      HighRisk LowRisk
  HighRisk      568     126
  LowRisk        97     600
confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1603163

I then used cross validation with the 10 x-variables, and the best model produced an error rate of 0.1603163

Logistic Regression to Categorize High Risk or Low Risk YPLL for US Counties


glm.fit = glm(y ~ LogSmoke + LogObese, data=CHD2[train,], family=binomial)
glm.probs = predict(glm.fit,CHD2[-train,],type="response")
glm.forecast = CHD2[-train,3]
glm.forecast[glm.probs<.5]="HighRisk"
glm.forecast[glm.probs>.5]="LowRisk"
plot(CHD2[-train,1],CHD2[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
     main="Logistic Regression Forecast",ylab="Log of Obese",xlab="Log of Smoke", pch=20,cex=1)

I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.

table(truth=CHD2[-train,3],prediction=glm.forecast)
          prediction
truth      HighRisk LowRisk
  HighRisk      513     166
  LowRisk       188     541
confusion.matrix=table(truth=CHD2[-train,3],prediction=glm.forecast)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2514205

With the Logistic Regression classification boundary the error rate was 0.2514205

CHD4 <- CHD3
CHD4$y = ifelse(CHD4$y=="LowRisk",1,0)
kable(CHD4[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
Smoke Obese PFP PUD MUD PI CR IDR PHR CIP y
2.94 3.64 2.89 1.44 1.46 3.43 5.83 4.30 8.79 2.94 0
2.83 3.43 2.89 1.41 1.44 3.18 5.83 4.23 8.25 2.71 1
3.09 3.78 3.26 1.63 1.53 3.33 6.32 4.29 8.46 3.91 0
3.00 3.64 3.00 1.48 1.46 3.56 5.71 4.61 8.70 3.30 0
3.00 3.53 3.04 1.50 1.55 3.37 4.74 4.65 8.33 2.94 0
3.14 3.66 3.37 1.65 1.57 3.37 6.60 4.70 8.82 3.87 0
CV2.out=bestglm(CHD4, IC="BICq",family=binomial, TopModels=1)
Morgan-Tatar search since family is non-gaussian.
CV2.out
BICq(q = 0.25)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) 81.5408700  3.5918084  22.701899 4.290586e-114
Smoke       -2.3221935  0.5268621  -4.407592  1.045260e-05
PFP         -1.9876339  0.5331587  -3.728034  1.929792e-04
PI          -4.0325770  0.4899790  -8.230101  1.870551e-16
CR          -0.9008511  0.1317616  -6.836978  8.088093e-12
IDR         -7.9599119  0.4055681 -19.626571  9.169736e-86
PHR         -1.1969635  0.2273456  -5.264951  1.402271e-07
CIP         -1.9532981  0.3010998  -6.487212  8.743898e-11
CV3.out=bestglm(CHD4, IC="CV",family=binomial, TopModels=1)
Morgan-Tatar search since family is non-gaussian.
CV3.out
CVd(d = 2463, REP = 1000)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) 81.5408700  3.5918084  22.701899 4.290586e-114
Smoke       -2.3221935  0.5268621  -4.407592  1.045260e-05
PFP         -1.9876339  0.5331587  -3.728034  1.929792e-04
PI          -4.0325770  0.4899790  -8.230101  1.870551e-16
CR          -0.9008511  0.1317616  -6.836978  8.088093e-12
IDR         -7.9599119  0.4055681 -19.626571  9.169736e-86
PHR         -1.1969635  0.2273456  -5.264951  1.402271e-07
CIP         -1.9532981  0.3010998  -6.487212  8.743898e-11

I expanded the analysis to 10 x-variables and used both Cross Validation and BIC to find the best model. Both produced the same results.

glm.fit = glm(y ~ Smoke + PFP + PI + CR + IDR + PHR + CIP, data=CHD4, family=binomial)
glm.probs = predict(glm.fit,type="response")
glm.forecast[glm.probs<.5]="HighRisk"
glm.forecast[glm.probs>.5]="LowRisk"
table(truth=CHD4[,11],prediction=glm.forecast)
     prediction
truth HighRisk LowRisk
    0     1266     171
    1      165    1273
confusion.matrix=table(truth=CHD4[,11],prediction=glm.forecast)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1168696

The ideal model produced an error rate of 0.1168696

Results Summary

PBI Risk

Model Error Rate
SVM 0.2735125
Naive Bayes 0.2754319
Logistic Regression 0.2754319

County Health Risk - 2 variables

Model Error Rate
SVM 0.2485795
Naive Bayes 0.2464489
Logistic Regression 0.2514205

County Health Risk - 10 variables

Model Error Rate
SVM 0.1170909
Naive Bayes 0.1603163
Logistic Regression 0.1168696