Support Vector classification attempts to construct a hyperplane that will correctly segregate the data. In two dimensions the hyperplane is defined by:
\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]
and in p-dimensions by:
\[\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p =0\]
If the equation is >0 then X lies on one side of the hyperplane and if it is <0 then X lies on the other side.
For linear classification the model finds the maximal margin hyperplane. The margin is the distance from the training observations to the separating hyperplane (the model maximizes this distance). This can be visualized in the graphic below.
The two classes are shown in red and green. The maximal margin hyperplane is shown as a solid line and the margin is the distance from the solid line to either of the dashed lines. The points that lie on the dashed line are support vectors. The red and green grid represents the decision rule made by the classifier.
We can see that there is one green point on the red side (misclassified). This was allowed by the model because I used a tuning parameter, C, which determines the number of points that can violate the margin. As C increases the margin will widen and we become more tolerant of violations. If C is large than the classifier will have low variance and high bias, but if C is small than the classifier will have high variance and low bias.
library(e1071)
library(class)
library(caret)
library(klaR)
library(bestglm)
set.seed(1)
x=matrix(rnorm(20*2),ncol = 2)
y=c(rep(-1,10),rep(1,10))
x[y==1,]=x[y==1,]+1
dat=data.frame(x=x,y=as.factor(y))
svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)
make.grid = function(x, n=75) {
grange = apply(x,2,range)
x1 = seq(from = grange[1,1], to=grange[2,1],length=n)
x2 = seq(from = grange[1,2], to=grange[2,2], length=n)
expand.grid(X1 = x1, X2 = x2)
}
This function requires a matrix as the x input. It then takes the range of the column values and generates a sequence of 75 values from the low to the high. Finally, expand.grid generates every combination of the x1 and x2 sequence values.
xgrid = make.grid(x)
colnames(xgrid) <- c("x.1","x.2")
ygrid = predict(svmfit, xgrid)
plot(xgrid, col=c("red","green")[as.numeric(ygrid)],pch = ".", cex=1, main="Sample Hyperplane")
palette(c("red","green","green"))
points(x,col=(2+y),pch=19)
}
I made a grid with the x matrix and used the predict function to generate a prediction for each point. I then plotted the grid with the x values as points.
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)
Unfortunately it is a bit difficult to extract the beta coefficients from the svm function, but the formulas above retrieve the beta values. We can then use the hyperplane equation
\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]
to solve for the slope and intercept of the decision boundary (plotted with abline).
PBI <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/PBI.csv")
library(knitr)
library(kableExtra)
kable(PBI[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
| Date | Open | High | Low | Close | Adj.Close | Volume | PBIret |
|---|---|---|---|---|---|---|---|
| 1/3/2007 | 46.40 | 46.49 | 45.87 | 46.37 | 23.43 | 964700 | NA |
| 1/4/2007 | 46.35 | 46.52 | 45.89 | 46.32 | 23.40 | 1068400 | -0.11 |
| 1/5/2007 | 46.34 | 46.49 | 46.19 | 46.40 | 23.44 | 728900 | 0.17 |
| 1/8/2007 | 46.35 | 47.06 | 46.25 | 46.86 | 23.67 | 1054000 | 0.99 |
| 1/9/2007 | 47.15 | 47.55 | 47.06 | 47.39 | 23.94 | 1176800 | 1.13 |
| 1/10/2007 | 47.24 | 47.95 | 46.91 | 47.70 | 24.10 | 1798700 | 0.65 |
I began by importing some basic stock data for PBI, from 1/3/07 to 2/5/19.
#Create Range and Risk columns
PBI$Range = PBI$High - PBI$Low
a = median(PBI$Range)
PBI$Risk = ifelse(PBI$Range<=a,"LowRisk","HighRisk")
#Shift Range down 1 to create Lag1
PBI$Lag1 = c(0,PBI[-nrow(PBI),9])
#Shift Range down 1 to create Lag2
PBI$Lag2 = c(0,PBI[-nrow(PBI),11])
#Create Logs
PBI$Log1 = log(PBI$Lag1)
PBI$Log2 = log(PBI$Lag2)
#Seperate 3 columns and delete top 2 rows
PBIa = as.data.frame(PBI[3:nrow(PBI),c(13,14,10)])
kable(PBIa[1:6,],digits=2)%>% kable_styling(bootstrap_options = "responsive")
| Log1 | Log2 | Risk | |
|---|---|---|---|
| 3 | -0.46 | -0.48 | LowRisk |
| 4 | -1.20 | -0.46 | HighRisk |
| 5 | -0.21 | -1.20 | HighRisk |
| 6 | -0.71 | -0.21 | HighRisk |
| 7 | 0.04 | -0.71 | HighRisk |
| 8 | -0.80 | 0.04 | LowRisk |
If the range of intra-day prices was below the median of all the days it was labeled LowRisk and if it was above it was labeled HighRisk. I then created a 1 day and 2 day lag of the range data and took the log of each.
y=as.factor(PBIa$Risk)
kable(contrasts(y),col.names = NULL)%>% kable_styling(full_width=F, position = "left")
| HighRisk | 0 |
| LowRisk | 1 |
As a factor, HighRisk is a 0 and LowRisk is a 1
palette(c("red","green"))
plot(PBIa$Log1,PBIa$Log2,col=y,main="Plot of Log Range",
ylab="Log of Lag2",xlab="Log of Lag1")
We can see that LowRisk days are clustered on the lower left and HighRisk on the upper right.
PBIb = data.frame(PBIa[,1:2],y)
set.seed(1234)
train = sample(3042,2000)
I needed to create a new data frame with y as a factor (as needed by the svm function) and I took a random sample of 2000 from the total of 3042 for my training data.
svmfit=svm(y~.,data=PBIb[train,],kernel="linear",cost=10,scale=FALSE)
summary(svmfit)
Call:
svm(formula = y ~ ., data = PBIb[train, ], kernel = "linear",
cost = 10, scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 10
gamma: 0.5
Number of Support Vectors: 1239
( 620 619 )
Number of Classes: 2
Levels:
HighRisk LowRisk
I used the svm function with the training data, a linear kernel, a cost of 10 (the tuning parameter) and I did not scale the data. We can see from the summary that there are 1239 support vectors, with 620 from one class and 619 from the other)
tune.out=tune(svm,y~.,data=PBIb[train,],kernel="linear",
ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(tune.out)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.01
- best performance: 0.2705
- Detailed performance results:
cost error dispersion
1 1e-03 0.2720 0.02898275
2 1e-02 0.2705 0.03086260
3 1e-01 0.2720 0.03409138
4 1e+00 0.2720 0.03301515
5 5e+00 0.2725 0.03285067
6 1e+01 0.2725 0.03285067
7 1e+02 0.2720 0.03301515
8 1e+03 0.2960 0.07712904
I then used the tune function to find the best model (uses cross validation and different tuning parameters). It found that a tuning parameter of 0.01 produced the best model.
svmfit=svm(y~.,data = PBIb[train,],kernel="linear",cost=tune.out$best.parameters,scale=FALSE)
x=as.matrix(PBIb[train,1:2])
xgrid=make.grid(x)
colnames(xgrid)<- c("Log1","Log2")
ygrid = predict(svmfit, xgrid)
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
main="SVM classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
ColorY=PBIb[train,3]
palette(c("red","green"))
points(PBIb[train,1:2],col=ColorY,pch=19)
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)
I ran the best model (cost = 0.01) and plotted the results. We can see that while the hyperplane does not perfectly split the data, which would be impossible, it does seems to split it at the ideal location. Red represents HighRisk and green represents LowRisk.
table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
prediction
truth HighRisk LowRisk
HighRisk 370 155
LowRisk 130 387
confusion.matrix=table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2735125
We can now use the predict function with best model to predict HighRisk and LowRisk days. Our test data will be the 1042 rows that were excluded from the training dataset (coded as -train). When viewed in the table format we can see the number of correct and incorrect classifications. The error rate was 0.2735125
X.Train = PBIb[train,1:2]
Y.Train = PBIb[train,3]
X.Test = PBIb[-train,1:2]
Y.Test = PBIb[-train,3]
model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
ygrid = predict(model$finalModel,xgrid)$class
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
I ran the Naïve Bayes model with cross validation (using the training data). I then plotted the xgrid predictions to view the classification boundary. We can see that it is not linear, as it was with SVM.
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
ColorY = PBIb[train,3]
palette(c("red","green"))
points(PBIb[train,1:2],col=ColorY,pch=19)
Here we can see how the HighRisk and LowRisk points compare to the boundary.
table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
Prediction
truth HighRisk LowRisk
HighRisk 367 158
LowRisk 129 388
confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2754319
With the Naïve Bayes classification boundary the error rate was 0.2754319
glm.fit = glm(y ~ Log1 + Log2, data=PBIb[train,], family=binomial)
glm.probs = predict(glm.fit,PBIb[-train,],type="response")
glm.forecast = PBIb[-train,3]
glm.forecast[glm.probs<.5]="HighRisk"
glm.forecast[glm.probs>.5]="LowRisk"
plot(PBIb[-train,1],PBIb[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
main="Logistic Regression Forecast",ylab="Log of Lag2",xlab="Log of Lag1", pch=20,cex=1)
I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.
table(truth=PBIb[-train,3],prediction=glm.forecast)
prediction
truth HighRisk LowRisk
HighRisk 378 147
LowRisk 140 377
confusion.matrix=table(truth=PBIb[-train,3],prediction=glm.forecast)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2754319
Logistic regression had an error rate of 0.2754319
I have chosen to study health data from “The County Health Rankings & Roadmaps” program, which is a collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute. The data can be found at the following web address:
| Data Elements | Code | Description |
|---|---|---|
| FIPS | FIPS | Federal Information Processing Standard |
| State | State | State |
| County | County | County |
| Years of Potential Life Lost Rate | YPLL | Age-adjusted YPLL rate per 100,000 |
| % Fair/Poor | PFP | Percentage of adults that report fair or poor health |
| Physically Unhealthy Days | PUD | Average number of reported physically unhealthy days per month |
| Mentally Unhealthy Days | MUD | Average number of reported mentally unhealthy days per month |
| % LBW | LBW | Percentage of births with low birth weight (<2500g) |
| % Smokers | Smoke | Percentage of adults that reported currently smoking |
| % Obese | Obese | Percentage of adults that report BMI >= 30 |
| Food Environment Index | FEI | Indicator of access to healthy foods - 0 is worst, 10 is best |
| % Physically Inactive | PI | Percentage of adults that report no leisure-time physical activity |
| % With Access | WA | Percentage of the population with access to places for physical activity |
| % Excessive Drinking | ED | Percentage of adults that report excessive drinking |
| Impaired | AI | Percentage of driving deaths with alcohol involvement |
| Chlamydia Rate | CR | Chlamydia cases per 100,000 population |
| Teen Birth Rate | TB | Births per 1,000 females ages 15-19 |
| % Uninsured | Unis | Percentage of people under age 65 without insurance |
| PCP Rate | PCP | Primary Care Physicians per 100,000 population |
| Dentist Rate | Dent | Dentists per 100,000 population |
| MHP Rate | MHP | Mental Health Providers per 100,000 population |
| Preventable Hosp. Rate | PHR | Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees |
| % Screened | PS | Percentage of female Medicare enrollees having an annual mammogram (age 65-74) |
| % Vaccinated | PV | Percentage of annual Medicare enrollees having an annual flu vaccination |
| Graduation Rate | GR | Graduation rate |
| % Some College | SC | Percentage of adults age 25-44 with some post-secondary education |
| % Unemployed | Unem | Percentage of population ages 16+ unemployed and looking for work |
| % Children in Poverty | CIP | Percentage of children (under age 18) living in poverty |
| Income Ratio | IR | Ratio of household income at the 80th percentile to income at the 20th percentile |
| % Single-Parent Households | SPH | Percentage of children that live in single-parent households |
| Association Rate | AR | Associations per 10,000 population |
| Violent Crime Rate | VCR | Violent crimes per 100,000 population |
| Injury Death Rate | IDR | Injury mortality rate per 100,000 |
| Average Daily PM | ADPM | Average daily amount of fine particulate matter in micrograms per cubic meter |
| Presence of violation | POV | County affected by a water violation: 1-Yes, 0-No |
| % Severe Housing Problems | SHP | Percentage of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities |
| % Drive Alone | PDA | Percentage of workers who drive alone to work |
| % Long Commute - Drives Alone | PLC | Among workers who commute in their car alone, the percentage that commute more than 30 minutes |
The dataset has 38 columns and 3,142 rows.
I intend to focus on Years of Potential Life Lost Rate (YPPL) as my dependent variable. According to Wikipedia
“Years of potential life lost (YPLL) or potential years of life lost (PYLL), is an estimate of the average years a person would have lived if he or she had not died prematurely.[1] It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people. An alternative is to consider the effects of both disability and premature death using disability adjusted life years.”
It is calculated by subtracting the person’s age at death from a reference age, like 75. A death at 60 would result in 15 years of potential lost life. Negative numbers are not included, so a death at 80 is calculated as 0.
I will categorize a value above the column median as High Risk and a value below the median as Low Risk, to signify the relative risk of dying early for a person in that county.
CHD <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/CHD.csv")
| FIPS | State | County | YPLL | PFP | PUD | MUD | LBW | Smoke | Obese | FEI | PI | WA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Alabama | Autauga | 8824 | 18 | 4.2 | 4.3 | 8 | 19 | 38 | 7.2 | 31 | 69 |
| ED | AI | CR | TB | Unis | PCP | Dent | MHP | PHR | PS | PV | GR | SC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 29 | 341.2 | 27 | 9 | 42 | 32 | 16 | 6599 | 44 | 41 | 90 | 61 |
| Unem | CIP | IR | SPH | AR | VCR | IDR | ADPM | POV | SHP | PDA | PLC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.9 | 19 | 4.6 | 25 | 12.6 | 272 | 74 | 11.7 | No | 15 | 86 | 38 |
pairs(CHD[,4:12])
pairs(CHD[,c(4,13:20)])
pairs(CHD[,c(4,21:28)])
pairs(CHD[,c(4,29:36)])
We can see from the scatterplots that YPPL tends to be correlated with healthiness measures like smoking, obesity, injuries, etc.
I began the analysis by starting with two x-variables, Smoking and Obesity. I logged both and added HighRisk/LowRisk as a factor.
CHD2 = CHD[,c(4,9,10)]
CHD2 <- na.omit(CHD2)
a = median(CHD2$YPLL)
CHD2$Risk = ifelse(CHD2$YPLL<=a,"LowRisk","HighRisk")
CHD2$LogSmoke = log(CHD2$Smoke)
CHD2$LogObese = log(CHD2$Obese)
y=as.factor(CHD2$Risk)
CHD2 = data.frame(CHD2[,5:6],y)
kable(CHD2[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive",full_width=F, position = "left")
| LogSmoke | LogObese | y |
|---|---|---|
| 2.94 | 3.64 | HighRisk |
| 2.83 | 3.43 | LowRisk |
| 3.09 | 3.78 | HighRisk |
| 3.00 | 3.64 | HighRisk |
| 3.00 | 3.53 | HighRisk |
| 3.14 | 3.66 | HighRisk |
set.seed(12345)
train = sample(2908,1500)
tune.out=tune(svm,y~.,data=CHD2[train,],kernel="linear",
ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(tune.out)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.01
- best performance: 0.2526667
- Detailed performance results:
cost error dispersion
1 1e-03 0.2600000 0.03527668
2 1e-02 0.2526667 0.02693385
3 1e-01 0.2546667 0.02606734
4 1e+00 0.2560000 0.02688797
5 5e+00 0.2560000 0.02688797
6 1e+01 0.2560000 0.02688797
7 1e+02 0.2560000 0.02688797
8 1e+03 0.2560000 0.02688797
I then took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.01.
svmfit=svm(y~.,data=CHD2[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
summary(svmfit)
Call:
svm(formula = y ~ ., data = CHD2[train, ], kernel = "linear",
cost = tune.out$best.parameters[1, 1], scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.01
gamma: 0.5
Number of Support Vectors: 1468
( 734 734 )
Number of Classes: 2
Levels:
HighRisk LowRisk
I ran the model with the optimal tuning parameter and we can see that there are 1468 support vectors, with 734 from one class and 734 from the other.
x=as.matrix(CHD2[train,1:2])
xgrid=make.grid(x)
colnames(xgrid)<- c("LogSmoke","LogObese")
ygrid = predict(svmfit, xgrid)
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
main="SVM classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
ColorY=CHD2[train,3]
palette(c("red","green"))
points(CHD2[train,1:2],col=ColorY,pch=19)
beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)
We can see from the plot that the hyperplane has split the data at the ideal location, with most of the High Risk datapoints clustered in the upper right. It’s no surprise that a higher percent of smoking and obesity leads to higher years of lost life.
table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
prediction
truth HighRisk LowRisk
HighRisk 516 175
LowRisk 175 542
confusion.matrix=table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2485795
I then used the predict function with the remaining rows of data (the test data, coded as -train) and used the table function to evaluate the outcome. The error rate was 0.2485795
Next, I expanded the analysis to 10 x-variables, as shown below.
| Data Elements | Code | Description |
|---|---|---|
| % Fair/Poor | PFP | Percentage of adults that report fair or poor health |
| Physically Unhealthy Days | PUD | Average number of reported physically unhealthy days per month |
| Mentally Unhealthy Days | MUD | Average number of reported mentally unhealthy days per month |
| % Smokers | Smoke | Percentage of adults that reported currently smoking |
| % Obese | Obese | Percentage of adults that report BMI >= 30 |
| % Physically Inactive | PI | Percentage of adults that report no leisure-time physical activity |
| Chlamydia Rate | CR | Chlamydia cases per 100,000 population |
| Preventable Hosp. Rate | PHR | Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees |
| % Children in Poverty | CIP | Percentage of children (under age 18) living in poverty |
| Injury Death Rate | IDR | Injury mortality rate per 100,000 |
CHD3 = CHD[,c(4,9,10,5,6,7,12,16,33,22,28)]
CHD3 <- na.omit(CHD3)
CHD3[,2:11] <-log(CHD3[,2:11])
kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
| YPLL | Smoke | Obese | PFP | PUD | MUD | PI | CR | IDR | PHR | CIP |
|---|---|---|---|---|---|---|---|---|---|---|
| 8824 | 2.94 | 3.64 | 2.89 | 1.44 | 1.46 | 3.43 | 5.83 | 4.30 | 8.79 | 2.94 |
| 7225 | 2.83 | 3.43 | 2.89 | 1.41 | 1.44 | 3.18 | 5.83 | 4.23 | 8.25 | 2.71 |
a=median(CHD3$YPLL)
CHD3$Risk = ifelse(CHD3$YPLL<=a,"LowRisk","HighRisk")
y=as.factor(CHD3$Risk)
CHD3 = data.frame(CHD3[,2:11],y)
kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
| Smoke | Obese | PFP | PUD | MUD | PI | CR | IDR | PHR | CIP | y |
|---|---|---|---|---|---|---|---|---|---|---|
| 2.94 | 3.64 | 2.89 | 1.44 | 1.46 | 3.43 | 5.83 | 4.30 | 8.79 | 2.94 | HighRisk |
| 2.83 | 3.43 | 2.89 | 1.41 | 1.44 | 3.18 | 5.83 | 4.23 | 8.25 | 2.71 | LowRisk |
set.seed(12345)
train = sample(2875,1500)
tune.out=tune(svm,y~.,data=CHD3[train,],kernel="linear",
ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
summary(tune.out)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.1153333
- Detailed performance results:
cost error dispersion
1 1e-03 0.1393333 0.03332593
2 1e-02 0.1173333 0.02518083
3 1e-01 0.1153333 0.02352829
4 1e+00 0.1180000 0.02155670
5 5e+00 0.1180000 0.02352829
6 1e+01 0.1180000 0.02352829
7 1e+02 0.1180000 0.02352829
8 1e+03 0.1180000 0.02352829
I took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.1.
svmfit=svm(y~.,data=CHD3[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
summary(svmfit)
Call:
svm(formula = y ~ ., data = CHD3[train, ], kernel = "linear",
cost = tune.out$best.parameters[1, 1], scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.1
gamma: 0.1
Number of Support Vectors: 623
( 312 311 )
Number of Classes: 2
Levels:
HighRisk LowRisk
I ran the model with the optimal tuning parameter and we can see that there are 623 support vectors, with 312 from one class and 311 from the other.
table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
prediction
truth HighRisk LowRisk
HighRisk 589 84
LowRisk 77 625
confusion.matrix=table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1170909
I then used the predict function with the remaining rows of data and used the table function to evaluate the outcome. The error rate was an impressive 0.1170909.
set.seed(1234)
train = sample(2908,1500)
X.Train = CHD2[train,1:2]
Y.Train = CHD2[train,3]
X.Test = CHD2[-train,1:2]
Y.Test = CHD2[-train,3]
model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
x=as.matrix(CHD2[train,1:2])
xgrid=make.grid(x)
colnames(xgrid)<- c("LogSmoke","LogObese")
ygrid = predict(model$finalModel,xgrid)$class
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
Here we can see the non-linear classification boundary for the two logged variables, smoking and obesity.
plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
ColorY = CHD2[train,3]
palette(c("red","green"))
points(CHD2[train,1:2],col=ColorY,pch=19)
When the points are plotted along with the boundary we can see how well it performs
table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
Prediction
truth HighRisk LowRisk
HighRisk 513 166
LowRisk 181 548
confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2464489
With the Naïve Bayes classification boundary the error rate was 0.2464489.
X.Train = CHD3[train,1:10]
Y.Train = CHD3[train,11]
X.Test = CHD3[-train,1:10]
Y.Test = CHD3[-train,11]
model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
Prediction
truth HighRisk LowRisk
HighRisk 568 126
LowRisk 97 600
confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1603163
I then used cross validation with the 10 x-variables, and the best model produced an error rate of 0.1603163
glm.fit = glm(y ~ LogSmoke + LogObese, data=CHD2[train,], family=binomial)
glm.probs = predict(glm.fit,CHD2[-train,],type="response")
glm.forecast = CHD2[-train,3]
glm.forecast[glm.probs<.5]="HighRisk"
glm.forecast[glm.probs>.5]="LowRisk"
plot(CHD2[-train,1],CHD2[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
main="Logistic Regression Forecast",ylab="Log of Obese",xlab="Log of Smoke", pch=20,cex=1)
I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.
table(truth=CHD2[-train,3],prediction=glm.forecast)
prediction
truth HighRisk LowRisk
HighRisk 513 166
LowRisk 188 541
confusion.matrix=table(truth=CHD2[-train,3],prediction=glm.forecast)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.2514205
With the Logistic Regression classification boundary the error rate was 0.2514205
CHD4 <- CHD3
CHD4$y = ifelse(CHD4$y=="LowRisk",1,0)
kable(CHD4[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")
| Smoke | Obese | PFP | PUD | MUD | PI | CR | IDR | PHR | CIP | y |
|---|---|---|---|---|---|---|---|---|---|---|
| 2.94 | 3.64 | 2.89 | 1.44 | 1.46 | 3.43 | 5.83 | 4.30 | 8.79 | 2.94 | 0 |
| 2.83 | 3.43 | 2.89 | 1.41 | 1.44 | 3.18 | 5.83 | 4.23 | 8.25 | 2.71 | 1 |
| 3.09 | 3.78 | 3.26 | 1.63 | 1.53 | 3.33 | 6.32 | 4.29 | 8.46 | 3.91 | 0 |
| 3.00 | 3.64 | 3.00 | 1.48 | 1.46 | 3.56 | 5.71 | 4.61 | 8.70 | 3.30 | 0 |
| 3.00 | 3.53 | 3.04 | 1.50 | 1.55 | 3.37 | 4.74 | 4.65 | 8.33 | 2.94 | 0 |
| 3.14 | 3.66 | 3.37 | 1.65 | 1.57 | 3.37 | 6.60 | 4.70 | 8.82 | 3.87 | 0 |
CV2.out=bestglm(CHD4, IC="BICq",family=binomial, TopModels=1)
Morgan-Tatar search since family is non-gaussian.
CV2.out
BICq(q = 0.25)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 81.5408700 3.5918084 22.701899 4.290586e-114
Smoke -2.3221935 0.5268621 -4.407592 1.045260e-05
PFP -1.9876339 0.5331587 -3.728034 1.929792e-04
PI -4.0325770 0.4899790 -8.230101 1.870551e-16
CR -0.9008511 0.1317616 -6.836978 8.088093e-12
IDR -7.9599119 0.4055681 -19.626571 9.169736e-86
PHR -1.1969635 0.2273456 -5.264951 1.402271e-07
CIP -1.9532981 0.3010998 -6.487212 8.743898e-11
CV3.out=bestglm(CHD4, IC="CV",family=binomial, TopModels=1)
Morgan-Tatar search since family is non-gaussian.
CV3.out
CVd(d = 2463, REP = 1000)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 81.5408700 3.5918084 22.701899 4.290586e-114
Smoke -2.3221935 0.5268621 -4.407592 1.045260e-05
PFP -1.9876339 0.5331587 -3.728034 1.929792e-04
PI -4.0325770 0.4899790 -8.230101 1.870551e-16
CR -0.9008511 0.1317616 -6.836978 8.088093e-12
IDR -7.9599119 0.4055681 -19.626571 9.169736e-86
PHR -1.1969635 0.2273456 -5.264951 1.402271e-07
CIP -1.9532981 0.3010998 -6.487212 8.743898e-11
I expanded the analysis to 10 x-variables and used both Cross Validation and BIC to find the best model. Both produced the same results.
glm.fit = glm(y ~ Smoke + PFP + PI + CR + IDR + PHR + CIP, data=CHD4, family=binomial)
glm.probs = predict(glm.fit,type="response")
glm.forecast[glm.probs<.5]="HighRisk"
glm.forecast[glm.probs>.5]="LowRisk"
table(truth=CHD4[,11],prediction=glm.forecast)
prediction
truth HighRisk LowRisk
0 1266 171
1 165 1273
confusion.matrix=table(truth=CHD4[,11],prediction=glm.forecast)
(confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)
[1] 0.1168696
The ideal model produced an error rate of 0.1168696
PBI Risk
| Model | Error Rate |
|---|---|
| SVM | 0.2735125 |
| Naive Bayes | 0.2754319 |
| Logistic Regression | 0.2754319 |
County Health Risk - 2 variables
| Model | Error Rate |
|---|---|
| SVM | 0.2485795 |
| Naive Bayes | 0.2464489 |
| Logistic Regression | 0.2514205 |
County Health Risk - 10 variables
| Model | Error Rate |
|---|---|
| SVM | 0.1170909 |
| Naive Bayes | 0.1603163 |
| Logistic Regression | 0.1168696 |