Formatting has been updated. See the original report here:

https://rpubs.com/pjozefek/573317

Introduction to Support Vector Machines


Support Vector classification attempts to construct a hyperplane that will correctly segregate the data. In two dimensions the hyperplane is defined by:

\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]

and in p-dimensions by:

\[\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p =0\]

If the equation is >0 then X lies on one side of the hyperplane and if it is <0 then X lies on the other side.

For linear classification the model finds the maximal margin hyperplane. The margin is the distance from the training observations to the separating hyperplane (the model maximizes this distance). This can be visualized in the graphic below.

The two classes are shown in red and green. The maximal margin hyperplane is shown as a solid line and the margin is the distance from the solid line to either of the dashed lines. The points that lie on the dashed line are support vectors. The red and green grid represents the decision rule made by the classifier.

We can see that there is one green point on the red side (misclassified). This was allowed by the model because I used a tuning parameter, C, which determines the number of points that can violate the margin. As C increases the margin will widen and we become more tolerant of violations. If C is large than the classifier will have low variance and high bias, but if C is small than the classifier will have high variance and low bias.

R Code for Sample Hyperplane


> library(e1071)
> library(class)
> library(caret)
> library(klaR)
> library(bestglm)
> 
> set.seed(1)
> x=matrix(rnorm(20*2),ncol = 2)
> y=c(rep(-1,10),rep(1,10))
> x[y==1,]=x[y==1,]+1
> dat=data.frame(x=x,y=as.factor(y))
> svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)
  1. I generated 40 random values from a normal distribution (2 columns of 20)
  2. The y variable was added as 10 rows of -1 and 10 rows of 1
  3. Where y=1 the x values were increased by 1
  4. The svm function from the “e1071” and “class” packages was used to calculate the model
> make.grid = function(x, n=75) {
+   grange = apply(x,2,range)
+   x1 = seq(from = grange[1,1], to=grange[2,1],length=n)
+   x2 = seq(from = grange[1,2], to=grange[2,2], length=n)
+   expand.grid(X1 = x1, X2 = x2)
+ }

This function requires a matrix as the x input. It then takes the range of the column values and generates a sequence of 75 values from the low to the high. Finally, expand.grid generates every combination of the x1 and x2 sequence values.

> xgrid = make.grid(x)
> colnames(xgrid) <- c("x.1","x.2")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid, col=c("red","green")[as.numeric(ygrid)],pch = ".", cex=1, main="Sample Hyperplane")
> palette(c("red","green","green"))
> points(x,col=(2+y),pch=19)  

I made a grid with the x matrix and used the predict function to generate a prediction for each point. I then plotted the grid with the x values as points.

> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

Unfortunately it is a bit difficult to extract the beta coefficients from the svm function, but the formulas above retrieve the beta values. We can then use the hyperplane equation

\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]

to solve for the slope and intercept of the decision boundary (plotted with abline).

Part 1: Pitney Bowes (PBI) data


SVM to categorize high risk or low risk days for Pitney Bowes (PBI)


> PBI <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/PBI.csv")
> library(knitr)
> library(kableExtra)
> kable(PBI[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
Date Open High Low Close Adj.Close Volume PBIret
1/3/2007 46.40 46.49 45.87 46.37 23.43 964700 NA
1/4/2007 46.35 46.52 45.89 46.32 23.40 1068400 -0.11
1/5/2007 46.34 46.49 46.19 46.40 23.44 728900 0.17
1/8/2007 46.35 47.06 46.25 46.86 23.67 1054000 0.99
1/9/2007 47.15 47.55 47.06 47.39 23.94 1176800 1.13
1/10/2007 47.24 47.95 46.91 47.70 24.10 1798700 0.65

I began by importing some basic stock data for PBI, from 1/3/07 to 2/5/19.

> #Create Range and Risk columns
> PBI$Range = PBI$High - PBI$Low
> a = median(PBI$Range)
> PBI$Risk = ifelse(PBI$Range<=a,"LowRisk","HighRisk")
> 
> #Shift Range down 1 to create Lag1
> PBI$Lag1 = c(0,PBI[-nrow(PBI),9])
> #Shift Range down 1 to create Lag2
> PBI$Lag2 = c(0,PBI[-nrow(PBI),11])
> 
> #Create Logs
> PBI$Log1 = log(PBI$Lag1)
> PBI$Log2 = log(PBI$Lag2)
> 
> #Seperate 3 columns and delete top 2 rows
> PBIa = as.data.frame(PBI[3:nrow(PBI),c(13,14,10)])
> kable(PBIa[1:6,],digits=2)%>% kable_styling(bootstrap_options = "responsive")
Log1 Log2 Risk
3 -0.46 -0.48 LowRisk
4 -1.20 -0.46 HighRisk
5 -0.21 -1.20 HighRisk
6 -0.71 -0.21 HighRisk
7 0.04 -0.71 HighRisk
8 -0.80 0.04 LowRisk

If the range of intra-day prices was below the median of all the days it was labeled LowRisk and if it was above it was labeled HighRisk. I then created a 1 day and 2 day lag of the range data and took the log of each.

> y=as.factor(PBIa$Risk)
> kable(contrasts(y),col.names = NULL)%>% kable_styling(full_width=F, position = "left")
HighRisk 0
LowRisk 1

As a factor, HighRisk is a 0 and LowRisk is a 1

> palette(c("red","green"))
> plot(PBIa$Log1,PBIa$Log2,col=y,main="Plot of Log Range",
+      ylab="Log of Lag2",xlab="Log of Lag1")

We can see that LowRisk days are clustered on the lower left and HighRisk on the upper right.

> PBIb = data.frame(PBIa[,1:2],y)
> set.seed(1234)
> train = sample(3042,2000)

I needed to create a new data frame with y as a factor (as needed by the svm function) and I took a random sample of 2000 from the total of 3042 for my training data.

> svmfit=svm(y~.,data=PBIb[train,],kernel="linear",cost=10,scale=FALSE)
> summary(svmfit)

Call:
svm(formula = y ~ ., data = PBIb[train, ], kernel = "linear", cost = 10, 
    scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 

Number of Support Vectors:  1208

 ( 604 604 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I used the svm function with the training data, a linear kernel, a cost of 10 (the tuning parameter) and I did not scale the data. We can see from the summary that there are 1208 support vectors, with 604 from one class and 604 from the other)

> tune.out=tune(svm,y~.,data=PBIb[train,],kernel="linear",
+               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
  0.1

- best performance: 0.262 

- Detailed performance results:
   cost  error dispersion
1 1e-03 0.2660 0.03526093
2 1e-02 0.2655 0.03475709
3 1e-01 0.2620 0.03224903
4 1e+00 0.2630 0.03351617
5 5e+00 0.2630 0.03417276
6 1e+01 0.2625 0.03417683
7 1e+02 0.2625 0.03417683
8 1e+03 0.2645 0.03459688

I then used the tune function to find the best model (uses cross validation and different tuning parameters). It found that a tuning parameter of 0.1 produced the best model.

> svmfit=svm(y~.,data = PBIb[train,],kernel="linear",cost=tune.out$best.parameters,scale=FALSE)
> x=as.matrix(PBIb[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("Log1","Log2")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="SVM classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
> ColorY=PBIb[train,3]
> palette(c("red","green"))
> points(PBIb[train,1:2],col=ColorY,pch=19)
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

I ran the best model (cost = 0.1) and plotted the results. We can see that while the hyperplane does not perfectly split the data, which would be impossible, it does seems to split it at the ideal location. Red represents HighRisk and green represents LowRisk.

> table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
          prediction
truth      HighRisk LowRisk
  HighRisk      333     181
  LowRisk       112     416
> confusion.matrix=table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.28119

We can now use the predict function with best model to predict HighRisk and LowRisk days. Our test data will be the 1042 rows that were excluded from the training dataset (coded as -train). When viewed in the table format we can see the number of correct and incorrect classifications. The error rate was 0.28119

Naive Bayes to categorize high risk or low risk days for Pitney Bowes (PBI)


> X.Train = PBIb[train,1:2]
> Y.Train = PBIb[train,3]
> X.Test = PBIb[-train,1:2]
> Y.Test = PBIb[-train,3]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> ygrid = predict(model$finalModel,xgrid)$class
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
+      main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")

I ran the Naive Bayes model with cross validation (using the training data). I then plotted the xgrid predictions to view the classification boundary. We can see that it is not linear, as it was with SVM.

> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
> ColorY = PBIb[train,3]
> palette(c("red","green"))
> points(PBIb[train,1:2],col=ColorY,pch=19)

Here we can see how the HighRisk and LowRisk points compare to the boundary.

> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
          Prediction
truth      HighRisk LowRisk
  HighRisk      327     187
  LowRisk       109     419
> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2840691

With the Naive Bayes classification boundary the error rate was 0.2840691

Logistic Regression to categorize high risk or low risk days for Pitney Bowes (PBI)


> glm.fit = glm(y ~ Log1 + Log2, data=PBIb[train,], family=binomial)
> glm.probs = predict(glm.fit,PBIb[-train,],type="response")
> glm.forecast = PBIb[-train,3]
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> plot(PBIb[-train,1],PBIb[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
+      main="Logistic Regression Forecast",ylab="Log of Lag2",xlab="Log of Lag1", pch=20,cex=1)

I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.

> table(truth=PBIb[-train,3],prediction=glm.forecast)
          prediction
truth      HighRisk LowRisk
  HighRisk      344     170
  LowRisk       125     403
> confusion.matrix=table(truth=PBIb[-train,3],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2831094

Logistic regression had an error rate of 0.2831094

Part 2: County Health Data


Introduction to County Health Dataset


I have chosen to study health data from “The County Health Rankings & Roadmaps” program, which is a collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute. The data can be found at the following web address:

County Health Data

Data Elements Code Description
FIPS FIPS Federal Information Processing Standard
State State State
County County County
Years of Potential Life Lost Rate YPLL Age-adjusted YPLL rate per 100,000
% Fair/Poor PFP Percentage of adults that report fair or poor health
Physically Unhealthy Days PUD Average number of reported physically unhealthy days per month
Mentally Unhealthy Days MUD Average number of reported mentally unhealthy days per month
% LBW LBW Percentage of births with low birth weight (<2500g)
% Smokers Smoke Percentage of adults that reported currently smoking
% Obese Obese Percentage of adults that report BMI >= 30
Food Environment Index FEI Indicator of access to healthy foods - 0 is worst, 10 is best
% Physically Inactive PI Percentage of adults that report no leisure-time physical activity
% With Access WA Percentage of the population with access to places for physical activity
% Excessive Drinking ED Percentage of adults that report excessive drinking
Impaired AI Percentage of driving deaths with alcohol involvement
Chlamydia Rate CR Chlamydia cases per 100,000 population
Teen Birth Rate TB Births per 1,000 females ages 15-19
% Uninsured Unis Percentage of people under age 65 without insurance
PCP Rate PCP Primary Care Physicians per 100,000 population
Dentist Rate Dent Dentists per 100,000 population
MHP Rate MHP Mental Health Providers per 100,000 population
Preventable Hosp. Rate PHR Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees
% Screened PS Percentage of female Medicare enrollees having an annual mammogram (age 65-74)
% Vaccinated PV Percentage of annual Medicare enrollees having an annual flu vaccination
Graduation Rate GR Graduation rate
% Some College SC Percentage of adults age 25-44 with some post-secondary education
% Unemployed Unem Percentage of population ages 16+ unemployed and looking for work
% Children in Poverty CIP Percentage of children (under age 18) living in poverty
Income Ratio IR Ratio of household income at the 80th percentile to income at the 20th percentile
% Single-Parent Households SPH Percentage of children that live in single-parent households
Association Rate AR Associations per 10,000 population
Violent Crime Rate VCR Violent crimes per 100,000 population
Injury Death Rate IDR Injury mortality rate per 100,000
Average Daily PM ADPM Average daily amount of fine particulate matter in micrograms per cubic meter
Presence of violation POV County affected by a water violation: 1-Yes, 0-No
% Severe Housing Problems SHP Percentage of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities
% Drive Alone PDA Percentage of workers who drive alone to work
% Long Commute - Drives Alone PLC Among workers who commute in their car alone, the percentage that commute more than 30 minutes

The dataset has 38 columns and 3,142 rows.

I intend to focus on Years of Potential Life Lost Rate (YPLL) as my dependent variable. According to Wikipedia

“Years of potential life lost (YPLL) or potential years of life lost (PYLL), is an estimate of the average years a person would have lived if he or she had not died prematurely.[1] It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people. An alternative is to consider the effects of both disability and premature death using disability adjusted life years.”

It is calculated by subtracting the person’s age at death from a reference age, like 75. A death at 60 would result in 15 years of potential lost life. Negative numbers are not included, so a death at 80 is calculated as 0.

I will categorize a value above the column median as High Risk and a value below the median as Low Risk, to signify the relative risk of dying early for a person in that county.

> CHD <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/CHD.csv")
FIPS State County YPLL PFP PUD MUD LBW Smoke Obese FEI PI WA
1001 Alabama Autauga 8824 18 4.2 4.3 8 19 38 7.2 31 69
ED AI CR TB Unis PCP Dent MHP PHR PS PV GR SC
17 29 341.2 27 9 42 32 16 6599 44 41 90 61
Unem CIP IR SPH AR VCR IDR ADPM POV SHP PDA PLC
3.9 19 4.6 25 12.6 272 74 11.7 No 15 86 38
> pairs(CHD[,4:12])

> pairs(CHD[,c(4,13:20)])

> pairs(CHD[,c(4,21:28)])

> pairs(CHD[,c(4,29:34,36)])

We can see from the scatterplots that YPLL tends to be correlated with healthiness measures like smoking, obesity, injuries, etc.

SVM to Categorize High Risk or Low Risk YPLL for US Counties


I began the analysis by starting with two x-variables, Smoking and Obesity. I logged both and added HighRisk/LowRisk as a factor.

> CHD2 = CHD[,c(4,9,10)]
> CHD2 <- na.omit(CHD2)
> a = median(CHD2$YPLL)
> CHD2$Risk = ifelse(CHD2$YPLL<=a,"LowRisk","HighRisk")
> CHD2$LogSmoke = log(CHD2$Smoke)
> CHD2$LogObese = log(CHD2$Obese)
> y=as.factor(CHD2$Risk)
> CHD2 = data.frame(CHD2[,5:6],y)
> kable(CHD2[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive",full_width=F, position = "left")
LogSmoke LogObese y
2.94 3.64 HighRisk
2.83 3.43 LowRisk
3.09 3.78 HighRisk
3.00 3.64 HighRisk
3.00 3.53 HighRisk
3.14 3.66 HighRisk
> set.seed(12345)
> train = sample(2908,1500)
> tune.out=tune(svm,y~.,data=CHD2[train,],kernel="linear",
+               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
  0.1

- best performance: 0.2466667 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.2560000 0.02458545
2 1e-02 0.2493333 0.02270585
3 1e-01 0.2466667 0.02266231
4 1e+00 0.2473333 0.02382032
5 5e+00 0.2473333 0.02382032
6 1e+01 0.2473333 0.02382032
7 1e+02 0.2473333 0.02382032
8 1e+03 0.2473333 0.02382032

I then took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.1.

> svmfit=svm(y~.,data=CHD2[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
> summary(svmfit)

Call:
svm(formula = y ~ ., data = CHD2[train, ], kernel = "linear", cost = tune.out$best.parameters[1, 
    1], scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.1 

Number of Support Vectors:  1078

 ( 539 539 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I ran the model with the optimal tuning parameter and we can see that there are 1078 support vectors, with 539 from one class and 539 from the other.

> x=as.matrix(CHD2[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("LogSmoke","LogObese")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="SVM classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
> ColorY=CHD2[train,3]
> palette(c("red","green"))
> points(CHD2[train,1:2],col=ColorY,pch=19)
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

We can see from the plot that the hyperplane has split the data at the ideal location, with most of the High Risk datapoints clustered in the upper right. It’s no surprise that a higher percent of smoking and obesity leads to higher years of lost life.

> table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
          prediction
truth      HighRisk LowRisk
  HighRisk      510     198
  LowRisk       156     544
> confusion.matrix=table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2514205

I then used the predict function with the remaining rows of data (the test data, coded as -train) and used the table function to evaluate the outcome. The error rate was 0.2514205

Next, I expanded the analysis to 10 x-variables, as shown below.

Data Elements Code Description
% Fair/Poor PFP Percentage of adults that report fair or poor health
Physically Unhealthy Days PUD Average number of reported physically unhealthy days per month
Mentally Unhealthy Days MUD Average number of reported mentally unhealthy days per month
% Smokers Smoke Percentage of adults that reported currently smoking
% Obese Obese Percentage of adults that report BMI >= 30
% Physically Inactive PI Percentage of adults that report no leisure-time physical activity
Chlamydia Rate CR Chlamydia cases per 100,000 population
Preventable Hosp. Rate PHR Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees
% Children in Poverty CIP Percentage of children (under age 18) living in poverty
Injury Death Rate IDR Injury mortality rate per 100,000
> CHD3 = CHD[,c(4,9,10,5,6,7,12,16,33,22,28)]
> CHD3 <- na.omit(CHD3)
> CHD3[,2:11] <-log(CHD3[,2:11])
> kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
YPLL Smoke Obese PFP PUD MUD PI CR IDR PHR CIP
8824 2.94 3.64 2.89 1.44 1.46 3.43 5.83 4.30 8.79 2.94
7225 2.83 3.43 2.89 1.41 1.44 3.18 5.83 4.23 8.25 2.71
> a=median(CHD3$YPLL)
> CHD3$Risk = ifelse(CHD3$YPLL<=a,"LowRisk","HighRisk")
> y=as.factor(CHD3$Risk)
> CHD3 = data.frame(CHD3[,2:11],y)
> kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
Smoke Obese PFP PUD MUD PI CR IDR PHR CIP y
2.94 3.64 2.89 1.44 1.46 3.43 5.83 4.30 8.79 2.94 HighRisk
2.83 3.43 2.89 1.41 1.44 3.18 5.83 4.23 8.25 2.71 LowRisk
> set.seed(12345)
> train = sample(2875,1500)
> tune.out=tune(svm,y~.,data=CHD3[train,],kernel="linear",
+               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
 1000

- best performance: 0.1093333 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.1346667 0.02283597
2 1e-02 0.1113333 0.02898275
3 1e-01 0.1133333 0.03370167
4 1e+00 0.1100000 0.03488960
5 5e+00 0.1100000 0.03488960
6 1e+01 0.1100000 0.03488960
7 1e+02 0.1100000 0.03517154
8 1e+03 0.1093333 0.03431157

I took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 1000.

> svmfit=svm(y~.,data=CHD3[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
> summary(svmfit)

Call:
svm(formula = y ~ ., data = CHD3[train, ], kernel = "linear", cost = tune.out$best.parameters[1, 
    1], scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1000 

Number of Support Vectors:  413

 ( 206 207 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I ran the model with the optimal tuning parameter and we can see that there are 413 support vectors, with 206 from one class and 207 from the other.

> table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
          prediction
truth      HighRisk LowRisk
  HighRisk      591      88
  LowRisk        80     616
> confusion.matrix=table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1221818

I then used the predict function with the remaining rows of data and used the table function to evaluate the outcome. The error rate was an impressive 0.1221818.

Naive Bayes to Categorize High Risk or Low Risk YPLL for US Counties


> set.seed(1234)
> train = sample(2908,1500)
> X.Train = CHD2[train,1:2]
> Y.Train = CHD2[train,3]
> X.Test = CHD2[-train,1:2]
> Y.Test = CHD2[-train,3]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> x=as.matrix(CHD2[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("LogSmoke","LogObese")
> ygrid = predict(model$finalModel,xgrid)$class
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
+      main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")

Here we can see the non-linear classification boundary for the two logged variables, smoking and obesity.

> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
> ColorY = CHD2[train,3]
> palette(c("red","green"))
> points(CHD2[train,1:2],col=ColorY,pch=19)

When the points are plotted along with the boundary we can see how well it performs

> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
          Prediction
truth      HighRisk LowRisk
  HighRisk      529     172
  LowRisk       197     510
> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2620739

With the Naive Bayes classification boundary the error rate was 0.2620739.

> X.Train = CHD3[train,1:10]
> Y.Train = CHD3[train,11]
> X.Test = CHD3[-train,1:10]
> Y.Test = CHD3[-train,11]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
          Prediction
truth      HighRisk LowRisk
  HighRisk      595     113
  LowRisk        91     590
> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1468683

I then used cross validation with the 10 x-variables, and the best model produced an error rate of 0.1468683

Logistic Regression to Categorize High Risk or Low Risk YPLL for US Counties


> glm.fit = glm(y ~ LogSmoke + LogObese, data=CHD2[train,], family=binomial)
> glm.probs = predict(glm.fit,CHD2[-train,],type="response")
> glm.forecast = CHD2[-train,3]
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> plot(CHD2[-train,1],CHD2[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
+      main="Logistic Regression Forecast",ylab="Log of Obese",xlab="Log of Smoke", pch=20,cex=1)

I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.

> table(truth=CHD2[-train,3],prediction=glm.forecast)
          prediction
truth      HighRisk LowRisk
  HighRisk      531     170
  LowRisk       200     507
> confusion.matrix=table(truth=CHD2[-train,3],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2627841

With the Logistic Regression classification boundary the error rate was 0.2627841

> CHD4 <- CHD3
> CHD4$y = ifelse(CHD4$y=="LowRisk",1,0)
> kable(CHD4[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
Smoke Obese PFP PUD MUD PI CR IDR PHR CIP y
2.94 3.64 2.89 1.44 1.46 3.43 5.83 4.30 8.79 2.94 0
2.83 3.43 2.89 1.41 1.44 3.18 5.83 4.23 8.25 2.71 1
3.09 3.78 3.26 1.63 1.53 3.33 6.32 4.29 8.46 3.91 0
3.00 3.64 3.00 1.48 1.46 3.56 5.71 4.61 8.70 3.30 0
3.00 3.53 3.04 1.50 1.55 3.37 4.74 4.65 8.33 2.94 0
3.14 3.66 3.37 1.65 1.57 3.37 6.60 4.70 8.82 3.87 0
> CV2.out=bestglm(CHD4, IC="BICq",family=binomial, TopModels=1)
> CV2.out
BICq(q = 0.25)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) 81.5408700  3.5918084  22.701899 4.290586e-114
Smoke       -2.3221935  0.5268621  -4.407592  1.045260e-05
PFP         -1.9876339  0.5331587  -3.728034  1.929792e-04
PI          -4.0325770  0.4899790  -8.230101  1.870551e-16
CR          -0.9008511  0.1317616  -6.836978  8.088093e-12
IDR         -7.9599119  0.4055681 -19.626571  9.169736e-86
PHR         -1.1969635  0.2273456  -5.264951  1.402271e-07
CIP         -1.9532981  0.3010998  -6.487212  8.743898e-11
> CV3.out=bestglm(CHD4, IC="CV",family=binomial, TopModels=1)
> CV3.out
CVd(d = 2463, REP = 1000)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) 81.5408700  3.5918084  22.701899 4.290586e-114
Smoke       -2.3221935  0.5268621  -4.407592  1.045260e-05
PFP         -1.9876339  0.5331587  -3.728034  1.929792e-04
PI          -4.0325770  0.4899790  -8.230101  1.870551e-16
CR          -0.9008511  0.1317616  -6.836978  8.088093e-12
IDR         -7.9599119  0.4055681 -19.626571  9.169736e-86
PHR         -1.1969635  0.2273456  -5.264951  1.402271e-07
CIP         -1.9532981  0.3010998  -6.487212  8.743898e-11

I expanded the analysis to 10 x-variables and used both Cross Validation and BIC to find the best model. Both produced the same results.

> glm.fit = glm(y ~ Smoke + PFP + PI + CR + IDR + PHR + CIP, data=CHD4, family=binomial)
> glm.probs = predict(glm.fit,type="response")
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> table(truth=CHD4[,11],prediction=glm.forecast)
     prediction
truth HighRisk LowRisk
    0     1266     171
    1      165    1273
> confusion.matrix=table(truth=CHD4[,11],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1168696

The ideal model produced an error rate of 0.1168696

Results Summary


PBI Risk

Model Error Rate
SVM 0.28119
Naive Bayes 0.2840691
Logistic Regression 0.2831094

County Health Risk - 2 variables

Model Error Rate
SVM 0.2514205
Naive Bayes 0.2620739
Logistic Regression 0.2627841

County Health Risk - 10 variables

Model Error Rate
SVM 0.1221818
Naive Bayes 0.1468683
Logistic Regression 0.1168696