Formatting has been updated. See the original report here:

Introduction to Support Vector Machines

Support Vector classification attempts to construct a hyperplane that will correctly segregate the data. In two dimensions the hyperplane is defined by:

\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]

and in p-dimensions by:

\[\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p =0\]

If the equation is >0 then X lies on one side of the hyperplane and if it is <0 then X lies on the other side.

For linear classification the model finds the maximal margin hyperplane. The margin is the distance from the training observations to the separating hyperplane (the model maximizes this distance). This can be visualized in the graphic below.

The two classes are shown in red and green. The maximal margin hyperplane is shown as a solid line and the margin is the distance from the solid line to either of the dashed lines. The points that lie on the dashed line are support vectors. The red and green grid represents the decision rule made by the classifier.

We can see that there is one green point on the red side (misclassified). This was allowed by the model because I used a tuning parameter, C, which determines the number of points that can violate the margin. As C increases the margin will widen and we become more tolerant of violations. If C is large than the classifier will have low variance and high bias, but if C is small than the classifier will have high variance and low bias.

R Code for Sample Hyperplane

> library(e1071)
> library(class)
> library(caret)
> library(klaR)
> library(bestglm)
> 
> set.seed(1)
> x=matrix(rnorm(20*2),ncol = 2)
> y=c(rep(-1,10),rep(1,10))
> x[y==1,]=x[y==1,]+1
> dat=data.frame(x=x,y=as.factor(y))
> svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)

I generated 40 random values from a normal distribution (2 columns of 20)
The y variable was added as 10 rows of -1 and 10 rows of 1
Where y=1 the x values were increased by 1
The svm function from the “e1071” and “class” packages was used to calculate the model

> make.grid = function(x, n=75) {
+   grange = apply(x,2,range)
+   x1 = seq(from = grange[1,1], to=grange[2,1],length=n)
+   x2 = seq(from = grange[1,2], to=grange[2,2], length=n)
+   expand.grid(X1 = x1, X2 = x2)
+ }

This function requires a matrix as the x input. It then takes the range of the column values and generates a sequence of 75 values from the low to the high. Finally, expand.grid generates every combination of the x1 and x2 sequence values.

> xgrid = make.grid(x)
> colnames(xgrid) <- c("x.1","x.2")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid, col=c("red","green")[as.numeric(ygrid)],pch = ".", cex=1, main="Sample Hyperplane")
> palette(c("red","green","green"))
> points(x,col=(2+y),pch=19)

I made a grid with the x matrix and used the predict function to generate a prediction for each point. I then plotted the grid with the x values as points.

> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

Unfortunately it is a bit difficult to extract the beta coefficients from the svm function, but the formulas above retrieve the beta values. We can then use the hyperplane equation

\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]

to solve for the slope and intercept of the decision boundary (plotted with abline).

Part 1: Pitney Bowes (PBI) data

SVM to categorize high risk or low risk days for Pitney Bowes (PBI)

> PBI <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/PBI.csv")
> library(knitr)
> library(kableExtra)
> kable(PBI[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")

Date	Open	High	Low	Close	Adj.Close	Volume	PBIret
1/3/2007	46.40	46.49	45.87	46.37	23.43	964700	NA
1/4/2007	46.35	46.52	45.89	46.32	23.40	1068400	-0.11
1/5/2007	46.34	46.49	46.19	46.40	23.44	728900	0.17
1/8/2007	46.35	47.06	46.25	46.86	23.67	1054000	0.99
1/9/2007	47.15	47.55	47.06	47.39	23.94	1176800	1.13
1/10/2007	47.24	47.95	46.91	47.70	24.10	1798700	0.65

I began by importing some basic stock data for PBI, from 1/3/07 to 2/5/19.

> #Create Range and Risk columns
> PBI$Range = PBI$High - PBI$Low
> a = median(PBI$Range)
> PBI$Risk = ifelse(PBI$Range<=a,"LowRisk","HighRisk")
> 
> #Shift Range down 1 to create Lag1
> PBI$Lag1 = c(0,PBI[-nrow(PBI),9])
> #Shift Range down 1 to create Lag2
> PBI$Lag2 = c(0,PBI[-nrow(PBI),11])
> 
> #Create Logs
> PBI$Log1 = log(PBI$Lag1)
> PBI$Log2 = log(PBI$Lag2)
> 
> #Seperate 3 columns and delete top 2 rows
> PBIa = as.data.frame(PBI[3:nrow(PBI),c(13,14,10)])
> kable(PBIa[1:6,],digits=2)%>% kable_styling(bootstrap_options = "responsive")

	Log1	Log2	Risk
3	-0.46	-0.48	LowRisk
4	-1.20	-0.46	HighRisk
5	-0.21	-1.20	HighRisk
6	-0.71	-0.21	HighRisk
7	0.04	-0.71	HighRisk
8	-0.80	0.04	LowRisk

If the range of intra-day prices was below the median of all the days it was labeled LowRisk and if it was above it was labeled HighRisk. I then created a 1 day and 2 day lag of the range data and took the log of each.

> y=as.factor(PBIa$Risk)
> kable(contrasts(y),col.names = NULL)%>% kable_styling(full_width=F, position = "left")

HighRisk	0
LowRisk	1

As a factor, HighRisk is a 0 and LowRisk is a 1

> palette(c("red","green"))
> plot(PBIa$Log1,PBIa$Log2,col=y,main="Plot of Log Range",
+      ylab="Log of Lag2",xlab="Log of Lag1")

We can see that LowRisk days are clustered on the lower left and HighRisk on the upper right.

> PBIb = data.frame(PBIa[,1:2],y)
> set.seed(1234)
> train = sample(3042,2000)

I needed to create a new data frame with y as a factor (as needed by the svm function) and I took a random sample of 2000 from the total of 3042 for my training data.

> svmfit=svm(y~.,data=PBIb[train,],kernel="linear",cost=10,scale=FALSE)
> summary(svmfit)


Call:
svm(formula = y ~ ., data = PBIb[train, ], kernel = "linear", cost = 10, 
    scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 

Number of Support Vectors:  1208

 ( 604 604 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I used the svm function with the training data, a linear kernel, a cost of 10 (the tuning parameter) and I did not scale the data. We can see from the summary that there are 1208 support vectors, with 604 from one class and 604 from the other)

> tune.out=tune(svm,y~.,data=PBIb[train,],kernel="linear",
+               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
  0.1

- best performance: 0.262 

- Detailed performance results:
   cost  error dispersion
1 1e-03 0.2660 0.03526093
2 1e-02 0.2655 0.03475709
3 1e-01 0.2620 0.03224903
4 1e+00 0.2630 0.03351617
5 5e+00 0.2630 0.03417276
6 1e+01 0.2625 0.03417683
7 1e+02 0.2625 0.03417683
8 1e+03 0.2645 0.03459688

I then used the tune function to find the best model (uses cross validation and different tuning parameters). It found that a tuning parameter of 0.1 produced the best model.

> svmfit=svm(y~.,data = PBIb[train,],kernel="linear",cost=tune.out$best.parameters,scale=FALSE)
> x=as.matrix(PBIb[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("Log1","Log2")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="SVM classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
> ColorY=PBIb[train,3]
> palette(c("red","green"))
> points(PBIb[train,1:2],col=ColorY,pch=19)
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

I ran the best model (cost = 0.1) and plotted the results. We can see that while the hyperplane does not perfectly split the data, which would be impossible, it does seems to split it at the ideal location. Red represents HighRisk and green represents LowRisk.

> table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))

          prediction
truth      HighRisk LowRisk
  HighRisk      333     181
  LowRisk       112     416

> confusion.matrix=table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.28119

We can now use the predict function with best model to predict HighRisk and LowRisk days. Our test data will be the 1042 rows that were excluded from the training dataset (coded as -train). When viewed in the table format we can see the number of correct and incorrect classifications. The error rate was 0.28119

Naive Bayes to categorize high risk or low risk days for Pitney Bowes (PBI)

> X.Train = PBIb[train,1:2]
> Y.Train = PBIb[train,3]
> X.Test = PBIb[-train,1:2]
> Y.Test = PBIb[-train,3]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> ygrid = predict(model$finalModel,xgrid)$class
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
+      main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")

I ran the Naive Bayes model with cross validation (using the training data). I then plotted the xgrid predictions to view the classification boundary. We can see that it is not linear, as it was with SVM.

> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
> ColorY = PBIb[train,3]
> palette(c("red","green"))
> points(PBIb[train,1:2],col=ColorY,pch=19)

Here we can see how the HighRisk and LowRisk points compare to the boundary.

> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)

          Prediction
truth      HighRisk LowRisk
  HighRisk      327     187
  LowRisk       109     419

> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.2840691

With the Naive Bayes classification boundary the error rate was 0.2840691

Logistic Regression to categorize high risk or low risk days for Pitney Bowes (PBI)

> glm.fit = glm(y ~ Log1 + Log2, data=PBIb[train,], family=binomial)
> glm.probs = predict(glm.fit,PBIb[-train,],type="response")
> glm.forecast = PBIb[-train,3]
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> plot(PBIb[-train,1],PBIb[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
+      main="Logistic Regression Forecast",ylab="Log of Lag2",xlab="Log of Lag1", pch=20,cex=1)

I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.

> table(truth=PBIb[-train,3],prediction=glm.forecast)

          prediction
truth      HighRisk LowRisk
  HighRisk      344     170
  LowRisk       125     403

> confusion.matrix=table(truth=PBIb[-train,3],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.2831094

Logistic regression had an error rate of 0.2831094

Part 2: County Health Data

Introduction to County Health Dataset

I have chosen to study health data from “The County Health Rankings & Roadmaps” program, which is a collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute. The data can be found at the following web address:

County Health Data

Data Elements	Code	Description
FIPS	FIPS	Federal Information Processing Standard
State	State	State
County	County	County
Years of Potential Life Lost Rate	YPLL	Age-adjusted YPLL rate per 100,000
% Fair/Poor	PFP	Percentage of adults that report fair or poor health
Physically Unhealthy Days	PUD	Average number of reported physically unhealthy days per month
Mentally Unhealthy Days	MUD	Average number of reported mentally unhealthy days per month
% LBW	LBW	Percentage of births with low birth weight (<2500g)
% Smokers	Smoke	Percentage of adults that reported currently smoking
% Obese	Obese	Percentage of adults that report BMI >= 30
Food Environment Index	FEI	Indicator of access to healthy foods - 0 is worst, 10 is best
% Physically Inactive	PI	Percentage of adults that report no leisure-time physical activity
% With Access	WA	Percentage of the population with access to places for physical activity
% Excessive Drinking	ED	Percentage of adults that report excessive drinking
Impaired	AI	Percentage of driving deaths with alcohol involvement
Chlamydia Rate	CR	Chlamydia cases per 100,000 population
Teen Birth Rate	TB	Births per 1,000 females ages 15-19
% Uninsured	Unis	Percentage of people under age 65 without insurance
PCP Rate	PCP	Primary Care Physicians per 100,000 population
Dentist Rate	Dent	Dentists per 100,000 population
MHP Rate	MHP	Mental Health Providers per 100,000 population
Preventable Hosp. Rate	PHR	Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees
% Screened	PS	Percentage of female Medicare enrollees having an annual mammogram (age 65-74)
% Vaccinated	PV	Percentage of annual Medicare enrollees having an annual flu vaccination
Graduation Rate	GR	Graduation rate
% Some College	SC	Percentage of adults age 25-44 with some post-secondary education
% Unemployed	Unem	Percentage of population ages 16+ unemployed and looking for work
% Children in Poverty	CIP	Percentage of children (under age 18) living in poverty
Income Ratio	IR	Ratio of household income at the 80th percentile to income at the 20th percentile
% Single-Parent Households	SPH	Percentage of children that live in single-parent households
Association Rate	AR	Associations per 10,000 population
Violent Crime Rate	VCR	Violent crimes per 100,000 population
Injury Death Rate	IDR	Injury mortality rate per 100,000
Average Daily PM	ADPM	Average daily amount of fine particulate matter in micrograms per cubic meter
Presence of violation	POV	County affected by a water violation: 1-Yes, 0-No
% Severe Housing Problems	SHP	Percentage of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities
% Drive Alone	PDA	Percentage of workers who drive alone to work
% Long Commute - Drives Alone	PLC	Among workers who commute in their car alone, the percentage that commute more than 30 minutes

The dataset has 38 columns and 3,142 rows.

I intend to focus on Years of Potential Life Lost Rate (YPLL) as my dependent variable. According to Wikipedia

“Years of potential life lost (YPLL) or potential years of life lost (PYLL), is an estimate of the average years a person would have lived if he or she had not died prematurely.[1] It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people. An alternative is to consider the effects of both disability and premature death using disability adjusted life years.”

It is calculated by subtracting the person’s age at death from a reference age, like 75. A death at 60 would result in 15 years of potential lost life. Negative numbers are not included, so a death at 80 is calculated as 0.

I will categorize a value above the column median as High Risk and a value below the median as Low Risk, to signify the relative risk of dying early for a person in that county.

> CHD <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/CHD.csv")

FIPS	State	County	YPLL	PFP	PUD	MUD	LBW	Smoke	Obese	FEI	PI	WA
1001	Alabama	Autauga	8824	18	4.2	4.3	8	19	38	7.2	31	69

ED	AI	CR	TB	Unis	PCP	Dent	MHP	PHR	PS	PV	GR	SC
17	29	341.2	27	9	42	32	16	6599	44	41	90	61

Unem	CIP	IR	SPH	AR	VCR	IDR	ADPM	POV	SHP	PDA	PLC
3.9	19	4.6	25	12.6	272	74	11.7	No	15	86	38

> pairs(CHD[,4:12])

> pairs(CHD[,c(4,13:20)])

> pairs(CHD[,c(4,21:28)])

> pairs(CHD[,c(4,29:34,36)])

We can see from the scatterplots that YPLL tends to be correlated with healthiness measures like smoking, obesity, injuries, etc.

SVM to Categorize High Risk or Low Risk YPLL for US Counties

I began the analysis by starting with two x-variables, Smoking and Obesity. I logged both and added HighRisk/LowRisk as a factor.

> CHD2 = CHD[,c(4,9,10)]
> CHD2 <- na.omit(CHD2)
> a = median(CHD2$YPLL)
> CHD2$Risk = ifelse(CHD2$YPLL<=a,"LowRisk","HighRisk")
> CHD2$LogSmoke = log(CHD2$Smoke)
> CHD2$LogObese = log(CHD2$Obese)
> y=as.factor(CHD2$Risk)
> CHD2 = data.frame(CHD2[,5:6],y)
> kable(CHD2[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive",full_width=F, position = "left")

LogSmoke	LogObese	y
2.94	3.64	HighRisk
2.83	3.43	LowRisk
3.09	3.78	HighRisk
3.00	3.64	HighRisk
3.00	3.53	HighRisk
3.14	3.66	HighRisk

> set.seed(12345)
> train = sample(2908,1500)
> tune.out=tune(svm,y~.,data=CHD2[train,],kernel="linear",
+               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
  0.1

- best performance: 0.2466667 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.2560000 0.02458545
2 1e-02 0.2493333 0.02270585
3 1e-01 0.2466667 0.02266231
4 1e+00 0.2473333 0.02382032
5 5e+00 0.2473333 0.02382032
6 1e+01 0.2473333 0.02382032
7 1e+02 0.2473333 0.02382032
8 1e+03 0.2473333 0.02382032

I then took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.1.

> svmfit=svm(y~.,data=CHD2[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
> summary(svmfit)


Call:
svm(formula = y ~ ., data = CHD2[train, ], kernel = "linear", cost = tune.out$best.parameters[1, 
    1], scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.1 

Number of Support Vectors:  1078

 ( 539 539 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I ran the model with the optimal tuning parameter and we can see that there are 1078 support vectors, with 539 from one class and 539 from the other.

> x=as.matrix(CHD2[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("LogSmoke","LogObese")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="SVM classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
> ColorY=CHD2[train,3]
> palette(c("red","green"))
> points(CHD2[train,1:2],col=ColorY,pch=19)
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)

We can see from the plot that the hyperplane has split the data at the ideal location, with most of the High Risk datapoints clustered in the upper right. It’s no surprise that a higher percent of smoking and obesity leads to higher years of lost life.

> table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))

          prediction
truth      HighRisk LowRisk
  HighRisk      510     198
  LowRisk       156     544

> confusion.matrix=table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.2514205

I then used the predict function with the remaining rows of data (the test data, coded as -train) and used the table function to evaluate the outcome. The error rate was 0.2514205

Next, I expanded the analysis to 10 x-variables, as shown below.

Data Elements	Code	Description
% Fair/Poor	PFP	Percentage of adults that report fair or poor health
Physically Unhealthy Days	PUD	Average number of reported physically unhealthy days per month
Mentally Unhealthy Days	MUD	Average number of reported mentally unhealthy days per month
% Smokers	Smoke	Percentage of adults that reported currently smoking
% Obese	Obese	Percentage of adults that report BMI >= 30
% Physically Inactive	PI	Percentage of adults that report no leisure-time physical activity
Chlamydia Rate	CR	Chlamydia cases per 100,000 population
Preventable Hosp. Rate	PHR	Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees
% Children in Poverty	CIP	Percentage of children (under age 18) living in poverty
Injury Death Rate	IDR	Injury mortality rate per 100,000

> CHD3 = CHD[,c(4,9,10,5,6,7,12,16,33,22,28)]
> CHD3 <- na.omit(CHD3)
> CHD3[,2:11] <-log(CHD3[,2:11])
> kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")

YPLL	Smoke	Obese	PFP	PUD	MUD	PI	CR	IDR	PHR	CIP
8824	2.94	3.64	2.89	1.44	1.46	3.43	5.83	4.30	8.79	2.94
7225	2.83	3.43	2.89	1.41	1.44	3.18	5.83	4.23	8.25	2.71

> a=median(CHD3$YPLL)
> CHD3$Risk = ifelse(CHD3$YPLL<=a,"LowRisk","HighRisk")
> y=as.factor(CHD3$Risk)
> CHD3 = data.frame(CHD3[,2:11],y)
> kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")

Smoke	Obese	PFP	PUD	MUD	PI	CR	IDR	PHR	CIP	y
2.94	3.64	2.89	1.44	1.46	3.43	5.83	4.30	8.79	2.94	HighRisk
2.83	3.43	2.89	1.41	1.44	3.18	5.83	4.23	8.25	2.71	LowRisk

> set.seed(12345)
> train = sample(2875,1500)
> tune.out=tune(svm,y~.,data=CHD3[train,],kernel="linear",
+               ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)


Parameter tuning of 'svm':

- sampling method: 10-fold cross validation 

- best parameters:
 cost
 1000

- best performance: 0.1093333 

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.1346667 0.02283597
2 1e-02 0.1113333 0.02898275
3 1e-01 0.1133333 0.03370167
4 1e+00 0.1100000 0.03488960
5 5e+00 0.1100000 0.03488960
6 1e+01 0.1100000 0.03488960
7 1e+02 0.1100000 0.03517154
8 1e+03 0.1093333 0.03431157

I took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 1000.

> svmfit=svm(y~.,data=CHD3[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
> summary(svmfit)


Call:
svm(formula = y ~ ., data = CHD3[train, ], kernel = "linear", cost = tune.out$best.parameters[1, 
    1], scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1000 

Number of Support Vectors:  413

 ( 206 207 )


Number of Classes:  2 

Levels: 
 HighRisk LowRisk

I ran the model with the optimal tuning parameter and we can see that there are 413 support vectors, with 206 from one class and 207 from the other.

> table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))

          prediction
truth      HighRisk LowRisk
  HighRisk      591      88
  LowRisk        80     616

> confusion.matrix=table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.1221818

I then used the predict function with the remaining rows of data and used the table function to evaluate the outcome. The error rate was an impressive 0.1221818.

Naive Bayes to Categorize High Risk or Low Risk YPLL for US Counties

> set.seed(1234)
> train = sample(2908,1500)
> X.Train = CHD2[train,1:2]
> Y.Train = CHD2[train,3]
> X.Test = CHD2[-train,1:2]
> Y.Test = CHD2[-train,3]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> x=as.matrix(CHD2[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("LogSmoke","LogObese")
> ygrid = predict(model$finalModel,xgrid)$class
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
+      main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")

Here we can see the non-linear classification boundary for the two logged variables, smoking and obesity.

> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+      main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
> ColorY = CHD2[train,3]
> palette(c("red","green"))
> points(CHD2[train,1:2],col=ColorY,pch=19)

When the points are plotted along with the boundary we can see how well it performs

> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)

          Prediction
truth      HighRisk LowRisk
  HighRisk      529     172
  LowRisk       197     510

> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.2620739

With the Naive Bayes classification boundary the error rate was 0.2620739.

> X.Train = CHD3[train,1:10]
> Y.Train = CHD3[train,11]
> X.Test = CHD3[-train,1:10]
> Y.Test = CHD3[-train,11]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)

          Prediction
truth      HighRisk LowRisk
  HighRisk      595     113
  LowRisk        91     590

> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.1468683

I then used cross validation with the 10 x-variables, and the best model produced an error rate of 0.1468683

Logistic Regression to Categorize High Risk or Low Risk YPLL for US Counties

> glm.fit = glm(y ~ LogSmoke + LogObese, data=CHD2[train,], family=binomial)
> glm.probs = predict(glm.fit,CHD2[-train,],type="response")
> glm.forecast = CHD2[-train,3]
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> plot(CHD2[-train,1],CHD2[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
+      main="Logistic Regression Forecast",ylab="Log of Obese",xlab="Log of Smoke", pch=20,cex=1)

I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.

> table(truth=CHD2[-train,3],prediction=glm.forecast)

          prediction
truth      HighRisk LowRisk
  HighRisk      531     170
  LowRisk       200     507

> confusion.matrix=table(truth=CHD2[-train,3],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.2627841

With the Logistic Regression classification boundary the error rate was 0.2627841

> CHD4 <- CHD3
> CHD4$y = ifelse(CHD4$y=="LowRisk",1,0)
> kable(CHD4[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")

Smoke	Obese	PFP	PUD	MUD	PI	CR	IDR	PHR	CIP	y
2.94	3.64	2.89	1.44	1.46	3.43	5.83	4.30	8.79	2.94	0
2.83	3.43	2.89	1.41	1.44	3.18	5.83	4.23	8.25	2.71	1
3.09	3.78	3.26	1.63	1.53	3.33	6.32	4.29	8.46	3.91	0
3.00	3.64	3.00	1.48	1.46	3.56	5.71	4.61	8.70	3.30	0
3.00	3.53	3.04	1.50	1.55	3.37	4.74	4.65	8.33	2.94	0
3.14	3.66	3.37	1.65	1.57	3.37	6.60	4.70	8.82	3.87	0

> CV2.out=bestglm(CHD4, IC="BICq",family=binomial, TopModels=1)
> CV2.out

BICq(q = 0.25)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) 81.5408700  3.5918084  22.701899 4.290586e-114
Smoke       -2.3221935  0.5268621  -4.407592  1.045260e-05
PFP         -1.9876339  0.5331587  -3.728034  1.929792e-04
PI          -4.0325770  0.4899790  -8.230101  1.870551e-16
CR          -0.9008511  0.1317616  -6.836978  8.088093e-12
IDR         -7.9599119  0.4055681 -19.626571  9.169736e-86
PHR         -1.1969635  0.2273456  -5.264951  1.402271e-07
CIP         -1.9532981  0.3010998  -6.487212  8.743898e-11

> CV3.out=bestglm(CHD4, IC="CV",family=binomial, TopModels=1)
> CV3.out

CVd(d = 2463, REP = 1000)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) 81.5408700  3.5918084  22.701899 4.290586e-114
Smoke       -2.3221935  0.5268621  -4.407592  1.045260e-05
PFP         -1.9876339  0.5331587  -3.728034  1.929792e-04
PI          -4.0325770  0.4899790  -8.230101  1.870551e-16
CR          -0.9008511  0.1317616  -6.836978  8.088093e-12
IDR         -7.9599119  0.4055681 -19.626571  9.169736e-86
PHR         -1.1969635  0.2273456  -5.264951  1.402271e-07
CIP         -1.9532981  0.3010998  -6.487212  8.743898e-11

I expanded the analysis to 10 x-variables and used both Cross Validation and BIC to find the best model. Both produced the same results.

> glm.fit = glm(y ~ Smoke + PFP + PI + CR + IDR + PHR + CIP, data=CHD4, family=binomial)
> glm.probs = predict(glm.fit,type="response")
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> table(truth=CHD4[,11],prediction=glm.forecast)

     prediction
truth HighRisk LowRisk
    0     1266     171
    1      165    1273

> confusion.matrix=table(truth=CHD4[,11],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)

[1] 0.1168696

The ideal model produced an error rate of 0.1168696

Results Summary

PBI Risk

Model	Error Rate
SVM	0.28119
Naive Bayes	0.2840691
Logistic Regression	0.2831094

County Health Risk - 2 variables

Model	Error Rate
SVM	0.2514205
Naive Bayes	0.2620739
Logistic Regression	0.2627841

County Health Risk - 10 variables

Model	Error Rate
SVM	0.1221818
Naive Bayes	0.1468683
Logistic Regression	0.1168696

SVM, Naive Bayes, and Logistic Regression - Updated

Classifaction Using Stock and County Health Data

Paul Jozefek

2021-01-13