Formatting has been updated. See the original report here:
https://rpubs.com/pjozefek/573317
Support Vector classification attempts to construct a hyperplane that will correctly segregate the data. In two dimensions the hyperplane is defined by:
\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]
and in p-dimensions by:
\[\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_pX_p =0\]
If the equation is >0 then X lies on one side of the hyperplane and if it is <0 then X lies on the other side.
For linear classification the model finds the maximal margin hyperplane. The margin is the distance from the training observations to the separating hyperplane (the model maximizes this distance). This can be visualized in the graphic below.
The two classes are shown in red and green. The maximal margin hyperplane is shown as a solid line and the margin is the distance from the solid line to either of the dashed lines. The points that lie on the dashed line are support vectors. The red and green grid represents the decision rule made by the classifier.
We can see that there is one green point on the red side (misclassified). This was allowed by the model because I used a tuning parameter, C, which determines the number of points that can violate the margin. As C increases the margin will widen and we become more tolerant of violations. If C is large than the classifier will have low variance and high bias, but if C is small than the classifier will have high variance and low bias.
> library(e1071)
> library(class)
> library(caret)
> library(klaR)
> library(bestglm)
>
> set.seed(1)
> x=matrix(rnorm(20*2),ncol = 2)
> y=c(rep(-1,10),rep(1,10))
> x[y==1,]=x[y==1,]+1
> dat=data.frame(x=x,y=as.factor(y))
> svmfit=svm(y~.,data=dat,kernel="linear",cost=10,scale=FALSE)> make.grid = function(x, n=75) {
+ grange = apply(x,2,range)
+ x1 = seq(from = grange[1,1], to=grange[2,1],length=n)
+ x2 = seq(from = grange[1,2], to=grange[2,2], length=n)
+ expand.grid(X1 = x1, X2 = x2)
+ }This function requires a matrix as the x input. It then takes the range of the column values and generates a sequence of 75 values from the low to the high. Finally, expand.grid generates every combination of the x1 and x2 sequence values.
> xgrid = make.grid(x)
> colnames(xgrid) <- c("x.1","x.2")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid, col=c("red","green")[as.numeric(ygrid)],pch = ".", cex=1, main="Sample Hyperplane")
> palette(c("red","green","green"))
> points(x,col=(2+y),pch=19) I made a grid with the x matrix and used the predict function to generate a prediction for each point. I then plotted the grid with the x values as points.
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)Unfortunately it is a bit difficult to extract the beta coefficients from the svm function, but the formulas above retrieve the beta values. We can then use the hyperplane equation
\[\beta_0 + \beta_1X_1 + \beta_2X_2 =0\]
to solve for the slope and intercept of the decision boundary (plotted with abline).
> PBI <- read.csv("~/Baruch/Baruch Class Projects/9890 Data Mining/PBI.csv")
> library(knitr)
> library(kableExtra)
> kable(PBI[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")| Date | Open | High | Low | Close | Adj.Close | Volume | PBIret |
|---|---|---|---|---|---|---|---|
| 1/3/2007 | 46.40 | 46.49 | 45.87 | 46.37 | 23.43 | 964700 | NA |
| 1/4/2007 | 46.35 | 46.52 | 45.89 | 46.32 | 23.40 | 1068400 | -0.11 |
| 1/5/2007 | 46.34 | 46.49 | 46.19 | 46.40 | 23.44 | 728900 | 0.17 |
| 1/8/2007 | 46.35 | 47.06 | 46.25 | 46.86 | 23.67 | 1054000 | 0.99 |
| 1/9/2007 | 47.15 | 47.55 | 47.06 | 47.39 | 23.94 | 1176800 | 1.13 |
| 1/10/2007 | 47.24 | 47.95 | 46.91 | 47.70 | 24.10 | 1798700 | 0.65 |
I began by importing some basic stock data for PBI, from 1/3/07 to 2/5/19.
> #Create Range and Risk columns
> PBI$Range = PBI$High - PBI$Low
> a = median(PBI$Range)
> PBI$Risk = ifelse(PBI$Range<=a,"LowRisk","HighRisk")
>
> #Shift Range down 1 to create Lag1
> PBI$Lag1 = c(0,PBI[-nrow(PBI),9])
> #Shift Range down 1 to create Lag2
> PBI$Lag2 = c(0,PBI[-nrow(PBI),11])
>
> #Create Logs
> PBI$Log1 = log(PBI$Lag1)
> PBI$Log2 = log(PBI$Lag2)
>
> #Seperate 3 columns and delete top 2 rows
> PBIa = as.data.frame(PBI[3:nrow(PBI),c(13,14,10)])
> kable(PBIa[1:6,],digits=2)%>% kable_styling(bootstrap_options = "responsive")| Log1 | Log2 | Risk | |
|---|---|---|---|
| 3 | -0.46 | -0.48 | LowRisk |
| 4 | -1.20 | -0.46 | HighRisk |
| 5 | -0.21 | -1.20 | HighRisk |
| 6 | -0.71 | -0.21 | HighRisk |
| 7 | 0.04 | -0.71 | HighRisk |
| 8 | -0.80 | 0.04 | LowRisk |
If the range of intra-day prices was below the median of all the days it was labeled LowRisk and if it was above it was labeled HighRisk. I then created a 1 day and 2 day lag of the range data and took the log of each.
> y=as.factor(PBIa$Risk)
> kable(contrasts(y),col.names = NULL)%>% kable_styling(full_width=F, position = "left")| HighRisk | 0 |
| LowRisk | 1 |
As a factor, HighRisk is a 0 and LowRisk is a 1
> palette(c("red","green"))
> plot(PBIa$Log1,PBIa$Log2,col=y,main="Plot of Log Range",
+ ylab="Log of Lag2",xlab="Log of Lag1")We can see that LowRisk days are clustered on the lower left and HighRisk on the upper right.
I needed to create a new data frame with y as a factor (as needed by the svm function) and I took a random sample of 2000 from the total of 3042 for my training data.
Call:
svm(formula = y ~ ., data = PBIb[train, ], kernel = "linear", cost = 10,
scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 10
Number of Support Vectors: 1208
( 604 604 )
Number of Classes: 2
Levels:
HighRisk LowRisk
I used the svm function with the training data, a linear kernel, a cost of 10 (the tuning parameter) and I did not scale the data. We can see from the summary that there are 1208 support vectors, with 604 from one class and 604 from the other)
> tune.out=tune(svm,y~.,data=PBIb[train,],kernel="linear",
+ ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.262
- Detailed performance results:
cost error dispersion
1 1e-03 0.2660 0.03526093
2 1e-02 0.2655 0.03475709
3 1e-01 0.2620 0.03224903
4 1e+00 0.2630 0.03351617
5 5e+00 0.2630 0.03417276
6 1e+01 0.2625 0.03417683
7 1e+02 0.2625 0.03417683
8 1e+03 0.2645 0.03459688
I then used the tune function to find the best model (uses cross validation and different tuning parameters). It found that a tuning parameter of 0.1 produced the best model.
> svmfit=svm(y~.,data = PBIb[train,],kernel="linear",cost=tune.out$best.parameters,scale=FALSE)
> x=as.matrix(PBIb[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("Log1","Log2")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+ main="SVM classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
> ColorY=PBIb[train,3]
> palette(c("red","green"))
> points(PBIb[train,1:2],col=ColorY,pch=19)
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)I ran the best model (cost = 0.1) and plotted the results. We can see that while the hyperplane does not perfectly split the data, which would be impossible, it does seems to split it at the ideal location. Red represents HighRisk and green represents LowRisk.
prediction
truth HighRisk LowRisk
HighRisk 333 181
LowRisk 112 416
> confusion.matrix=table(truth = PBIb[-train,3],prediction = predict(tune.out$best.model,newdata=PBIb[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.28119
We can now use the predict function with best model to predict HighRisk and LowRisk days. Our test data will be the 1042 rows that were excluded from the training dataset (coded as -train). When viewed in the table format we can see the number of correct and incorrect classifications. The error rate was 0.28119
> X.Train = PBIb[train,1:2]
> Y.Train = PBIb[train,3]
> X.Test = PBIb[-train,1:2]
> Y.Test = PBIb[-train,3]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> ygrid = predict(model$finalModel,xgrid)$class
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
+ main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")I ran the Naive Bayes model with cross validation (using the training data). I then plotted the xgrid predictions to view the classification boundary. We can see that it is not linear, as it was with SVM.
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+ main="NB classification of PBI risk",ylab="Log of Lag2",xlab="Log of Lag1")
> ColorY = PBIb[train,3]
> palette(c("red","green"))
> points(PBIb[train,1:2],col=ColorY,pch=19)Here we can see how the HighRisk and LowRisk points compare to the boundary.
Prediction
truth HighRisk LowRisk
HighRisk 327 187
LowRisk 109 419
> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.2840691
With the Naive Bayes classification boundary the error rate was 0.2840691
> glm.fit = glm(y ~ Log1 + Log2, data=PBIb[train,], family=binomial)
> glm.probs = predict(glm.fit,PBIb[-train,],type="response")
> glm.forecast = PBIb[-train,3]
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> plot(PBIb[-train,1],PBIb[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
+ main="Logistic Regression Forecast",ylab="Log of Lag2",xlab="Log of Lag1", pch=20,cex=1)I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.
prediction
truth HighRisk LowRisk
HighRisk 344 170
LowRisk 125 403
> confusion.matrix=table(truth=PBIb[-train,3],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.2831094
Logistic regression had an error rate of 0.2831094
I have chosen to study health data from “The County Health Rankings & Roadmaps” program, which is a collaboration between the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute. The data can be found at the following web address:
| Data Elements | Code | Description |
|---|---|---|
| FIPS | FIPS | Federal Information Processing Standard |
| State | State | State |
| County | County | County |
| Years of Potential Life Lost Rate | YPLL | Age-adjusted YPLL rate per 100,000 |
| % Fair/Poor | PFP | Percentage of adults that report fair or poor health |
| Physically Unhealthy Days | PUD | Average number of reported physically unhealthy days per month |
| Mentally Unhealthy Days | MUD | Average number of reported mentally unhealthy days per month |
| % LBW | LBW | Percentage of births with low birth weight (<2500g) |
| % Smokers | Smoke | Percentage of adults that reported currently smoking |
| % Obese | Obese | Percentage of adults that report BMI >= 30 |
| Food Environment Index | FEI | Indicator of access to healthy foods - 0 is worst, 10 is best |
| % Physically Inactive | PI | Percentage of adults that report no leisure-time physical activity |
| % With Access | WA | Percentage of the population with access to places for physical activity |
| % Excessive Drinking | ED | Percentage of adults that report excessive drinking |
| Impaired | AI | Percentage of driving deaths with alcohol involvement |
| Chlamydia Rate | CR | Chlamydia cases per 100,000 population |
| Teen Birth Rate | TB | Births per 1,000 females ages 15-19 |
| % Uninsured | Unis | Percentage of people under age 65 without insurance |
| PCP Rate | PCP | Primary Care Physicians per 100,000 population |
| Dentist Rate | Dent | Dentists per 100,000 population |
| MHP Rate | MHP | Mental Health Providers per 100,000 population |
| Preventable Hosp. Rate | PHR | Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees |
| % Screened | PS | Percentage of female Medicare enrollees having an annual mammogram (age 65-74) |
| % Vaccinated | PV | Percentage of annual Medicare enrollees having an annual flu vaccination |
| Graduation Rate | GR | Graduation rate |
| % Some College | SC | Percentage of adults age 25-44 with some post-secondary education |
| % Unemployed | Unem | Percentage of population ages 16+ unemployed and looking for work |
| % Children in Poverty | CIP | Percentage of children (under age 18) living in poverty |
| Income Ratio | IR | Ratio of household income at the 80th percentile to income at the 20th percentile |
| % Single-Parent Households | SPH | Percentage of children that live in single-parent households |
| Association Rate | AR | Associations per 10,000 population |
| Violent Crime Rate | VCR | Violent crimes per 100,000 population |
| Injury Death Rate | IDR | Injury mortality rate per 100,000 |
| Average Daily PM | ADPM | Average daily amount of fine particulate matter in micrograms per cubic meter |
| Presence of violation | POV | County affected by a water violation: 1-Yes, 0-No |
| % Severe Housing Problems | SHP | Percentage of households with at least 1 of 4 housing problems: overcrowding, high housing costs, or lack of kitchen or plumbing facilities |
| % Drive Alone | PDA | Percentage of workers who drive alone to work |
| % Long Commute - Drives Alone | PLC | Among workers who commute in their car alone, the percentage that commute more than 30 minutes |
The dataset has 38 columns and 3,142 rows.
I intend to focus on Years of Potential Life Lost Rate (YPLL) as my dependent variable. According to Wikipedia
“Years of potential life lost (YPLL) or potential years of life lost (PYLL), is an estimate of the average years a person would have lived if he or she had not died prematurely.[1] It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people. An alternative is to consider the effects of both disability and premature death using disability adjusted life years.”
It is calculated by subtracting the person’s age at death from a reference age, like 75. A death at 60 would result in 15 years of potential lost life. Negative numbers are not included, so a death at 80 is calculated as 0.
I will categorize a value above the column median as High Risk and a value below the median as Low Risk, to signify the relative risk of dying early for a person in that county.
| FIPS | State | County | YPLL | PFP | PUD | MUD | LBW | Smoke | Obese | FEI | PI | WA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Alabama | Autauga | 8824 | 18 | 4.2 | 4.3 | 8 | 19 | 38 | 7.2 | 31 | 69 |
| ED | AI | CR | TB | Unis | PCP | Dent | MHP | PHR | PS | PV | GR | SC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 29 | 341.2 | 27 | 9 | 42 | 32 | 16 | 6599 | 44 | 41 | 90 | 61 |
| Unem | CIP | IR | SPH | AR | VCR | IDR | ADPM | POV | SHP | PDA | PLC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.9 | 19 | 4.6 | 25 | 12.6 | 272 | 74 | 11.7 | No | 15 | 86 | 38 |
We can see from the scatterplots that YPLL tends to be correlated with healthiness measures like smoking, obesity, injuries, etc.
I began the analysis by starting with two x-variables, Smoking and Obesity. I logged both and added HighRisk/LowRisk as a factor.
> CHD2 = CHD[,c(4,9,10)]
> CHD2 <- na.omit(CHD2)
> a = median(CHD2$YPLL)
> CHD2$Risk = ifelse(CHD2$YPLL<=a,"LowRisk","HighRisk")
> CHD2$LogSmoke = log(CHD2$Smoke)
> CHD2$LogObese = log(CHD2$Obese)
> y=as.factor(CHD2$Risk)
> CHD2 = data.frame(CHD2[,5:6],y)
> kable(CHD2[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive",full_width=F, position = "left")| LogSmoke | LogObese | y |
|---|---|---|
| 2.94 | 3.64 | HighRisk |
| 2.83 | 3.43 | LowRisk |
| 3.09 | 3.78 | HighRisk |
| 3.00 | 3.64 | HighRisk |
| 3.00 | 3.53 | HighRisk |
| 3.14 | 3.66 | HighRisk |
> set.seed(12345)
> train = sample(2908,1500)
> tune.out=tune(svm,y~.,data=CHD2[train,],kernel="linear",
+ ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
0.1
- best performance: 0.2466667
- Detailed performance results:
cost error dispersion
1 1e-03 0.2560000 0.02458545
2 1e-02 0.2493333 0.02270585
3 1e-01 0.2466667 0.02266231
4 1e+00 0.2473333 0.02382032
5 5e+00 0.2473333 0.02382032
6 1e+01 0.2473333 0.02382032
7 1e+02 0.2473333 0.02382032
8 1e+03 0.2473333 0.02382032
I then took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 0.1.
> svmfit=svm(y~.,data=CHD2[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
> summary(svmfit)
Call:
svm(formula = y ~ ., data = CHD2[train, ], kernel = "linear", cost = tune.out$best.parameters[1,
1], scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.1
Number of Support Vectors: 1078
( 539 539 )
Number of Classes: 2
Levels:
HighRisk LowRisk
I ran the model with the optimal tuning parameter and we can see that there are 1078 support vectors, with 539 from one class and 539 from the other.
> x=as.matrix(CHD2[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("LogSmoke","LogObese")
> ygrid = predict(svmfit, xgrid)
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+ main="SVM classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
> ColorY=CHD2[train,3]
> palette(c("red","green"))
> points(CHD2[train,1:2],col=ColorY,pch=19)
> beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
> beta0 = svmfit$rho
> abline(beta0 / beta[2], -beta[1] / beta[2])
> abline((beta0 -1) / beta[2], -beta[1] / beta[2], lty =2)
> abline((beta0 +1) / beta[2], -beta[1] / beta[2], lty =2)We can see from the plot that the hyperplane has split the data at the ideal location, with most of the High Risk datapoints clustered in the upper right. It’s no surprise that a higher percent of smoking and obesity leads to higher years of lost life.
prediction
truth HighRisk LowRisk
HighRisk 510 198
LowRisk 156 544
> confusion.matrix=table(truth = CHD2[-train,3],prediction = predict(tune.out$best.model,newdata=CHD2[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.2514205
I then used the predict function with the remaining rows of data (the test data, coded as -train) and used the table function to evaluate the outcome. The error rate was 0.2514205
Next, I expanded the analysis to 10 x-variables, as shown below.
| Data Elements | Code | Description |
|---|---|---|
| % Fair/Poor | PFP | Percentage of adults that report fair or poor health |
| Physically Unhealthy Days | PUD | Average number of reported physically unhealthy days per month |
| Mentally Unhealthy Days | MUD | Average number of reported mentally unhealthy days per month |
| % Smokers | Smoke | Percentage of adults that reported currently smoking |
| % Obese | Obese | Percentage of adults that report BMI >= 30 |
| % Physically Inactive | PI | Percentage of adults that report no leisure-time physical activity |
| Chlamydia Rate | CR | Chlamydia cases per 100,000 population |
| Preventable Hosp. Rate | PHR | Discharges for Ambulatory Care Sensitive Conditions per 100,000 Medicare Enrollees |
| % Children in Poverty | CIP | Percentage of children (under age 18) living in poverty |
| Injury Death Rate | IDR | Injury mortality rate per 100,000 |
> CHD3 = CHD[,c(4,9,10,5,6,7,12,16,33,22,28)]
> CHD3 <- na.omit(CHD3)
> CHD3[,2:11] <-log(CHD3[,2:11])
> kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")| YPLL | Smoke | Obese | PFP | PUD | MUD | PI | CR | IDR | PHR | CIP |
|---|---|---|---|---|---|---|---|---|---|---|
| 8824 | 2.94 | 3.64 | 2.89 | 1.44 | 1.46 | 3.43 | 5.83 | 4.30 | 8.79 | 2.94 |
| 7225 | 2.83 | 3.43 | 2.89 | 1.41 | 1.44 | 3.18 | 5.83 | 4.23 | 8.25 | 2.71 |
> a=median(CHD3$YPLL)
> CHD3$Risk = ifelse(CHD3$YPLL<=a,"LowRisk","HighRisk")
> y=as.factor(CHD3$Risk)
> CHD3 = data.frame(CHD3[,2:11],y)
> kable(CHD3[1:2,],digits=2) %>% kable_styling(bootstrap_options = "responsive")| Smoke | Obese | PFP | PUD | MUD | PI | CR | IDR | PHR | CIP | y |
|---|---|---|---|---|---|---|---|---|---|---|
| 2.94 | 3.64 | 2.89 | 1.44 | 1.46 | 3.43 | 5.83 | 4.30 | 8.79 | 2.94 | HighRisk |
| 2.83 | 3.43 | 2.89 | 1.41 | 1.44 | 3.18 | 5.83 | 4.23 | 8.25 | 2.71 | LowRisk |
> set.seed(12345)
> train = sample(2875,1500)
> tune.out=tune(svm,y~.,data=CHD3[train,],kernel="linear",
+ ranges=list(cost=c(0.001,0.01,0.1,1,5,10,100,1000)))
> summary(tune.out)
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
cost
1000
- best performance: 0.1093333
- Detailed performance results:
cost error dispersion
1 1e-03 0.1346667 0.02283597
2 1e-02 0.1113333 0.02898275
3 1e-01 0.1133333 0.03370167
4 1e+00 0.1100000 0.03488960
5 5e+00 0.1100000 0.03488960
6 1e+01 0.1100000 0.03488960
7 1e+02 0.1100000 0.03517154
8 1e+03 0.1093333 0.03431157
I took a sample of 1,500 and used cross validation to find the best tuning parameter, which resulted in a value of 1000.
> svmfit=svm(y~.,data=CHD3[train,],kernel="linear",cost=tune.out$best.parameters[1,1],scale=FALSE)
> summary(svmfit)
Call:
svm(formula = y ~ ., data = CHD3[train, ], kernel = "linear", cost = tune.out$best.parameters[1,
1], scale = FALSE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1000
Number of Support Vectors: 413
( 206 207 )
Number of Classes: 2
Levels:
HighRisk LowRisk
I ran the model with the optimal tuning parameter and we can see that there are 413 support vectors, with 206 from one class and 207 from the other.
prediction
truth HighRisk LowRisk
HighRisk 591 88
LowRisk 80 616
> confusion.matrix=table(truth = CHD3[-train,11],prediction = predict(tune.out$best.model,newdata=CHD3[-train,]))
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.1221818
I then used the predict function with the remaining rows of data and used the table function to evaluate the outcome. The error rate was an impressive 0.1221818.
> set.seed(1234)
> train = sample(2908,1500)
> X.Train = CHD2[train,1:2]
> Y.Train = CHD2[train,3]
> X.Test = CHD2[-train,1:2]
> Y.Test = CHD2[-train,3]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> x=as.matrix(CHD2[train,1:2])
> xgrid=make.grid(x)
> colnames(xgrid)<- c("LogSmoke","LogObese")
> ygrid = predict(model$finalModel,xgrid)$class
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=20,cex=1,
+ main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")Here we can see the non-linear classification boundary for the two logged variables, smoking and obesity.
> plot(xgrid,col=c("red","green")[as.numeric(ygrid)],pch=".",cex=1,
+ main="NB classification of YPLL risk",ylab="Log of Obese",xlab="Log of Smoke")
> ColorY = CHD2[train,3]
> palette(c("red","green"))
> points(CHD2[train,1:2],col=ColorY,pch=19)When the points are plotted along with the boundary we can see how well it performs
Prediction
truth HighRisk LowRisk
HighRisk 529 172
LowRisk 197 510
> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.2620739
With the Naive Bayes classification boundary the error rate was 0.2620739.
> X.Train = CHD3[train,1:10]
> Y.Train = CHD3[train,11]
> X.Test = CHD3[-train,1:10]
> Y.Test = CHD3[-train,11]
> model = train(X.Train,Y.Train,'nb',trControl=trainControl(method='cv',number=10))
> table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class) Prediction
truth HighRisk LowRisk
HighRisk 595 113
LowRisk 91 590
> confusion.matrix = table(truth=Y.Test,Prediction=predict(model$finalModel,X.Test)$class)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.1468683
I then used cross validation with the 10 x-variables, and the best model produced an error rate of 0.1468683
> glm.fit = glm(y ~ LogSmoke + LogObese, data=CHD2[train,], family=binomial)
> glm.probs = predict(glm.fit,CHD2[-train,],type="response")
> glm.forecast = CHD2[-train,3]
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> plot(CHD2[-train,1],CHD2[-train,2],col=c("red","green")[as.numeric(glm.forecast)],
+ main="Logistic Regression Forecast",ylab="Log of Obese",xlab="Log of Smoke", pch=20,cex=1)I ran the logistic regression model with the training data and used the predict function with the test data. We can see from the plot that there is a linear prediction boundary.
prediction
truth HighRisk LowRisk
HighRisk 531 170
LowRisk 200 507
> confusion.matrix=table(truth=CHD2[-train,3],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.2627841
With the Logistic Regression classification boundary the error rate was 0.2627841
> CHD4 <- CHD3
> CHD4$y = ifelse(CHD4$y=="LowRisk",1,0)
> kable(CHD4[1:6,],digits=2) %>% kable_styling(bootstrap_options = "responsive")| Smoke | Obese | PFP | PUD | MUD | PI | CR | IDR | PHR | CIP | y |
|---|---|---|---|---|---|---|---|---|---|---|
| 2.94 | 3.64 | 2.89 | 1.44 | 1.46 | 3.43 | 5.83 | 4.30 | 8.79 | 2.94 | 0 |
| 2.83 | 3.43 | 2.89 | 1.41 | 1.44 | 3.18 | 5.83 | 4.23 | 8.25 | 2.71 | 1 |
| 3.09 | 3.78 | 3.26 | 1.63 | 1.53 | 3.33 | 6.32 | 4.29 | 8.46 | 3.91 | 0 |
| 3.00 | 3.64 | 3.00 | 1.48 | 1.46 | 3.56 | 5.71 | 4.61 | 8.70 | 3.30 | 0 |
| 3.00 | 3.53 | 3.04 | 1.50 | 1.55 | 3.37 | 4.74 | 4.65 | 8.33 | 2.94 | 0 |
| 3.14 | 3.66 | 3.37 | 1.65 | 1.57 | 3.37 | 6.60 | 4.70 | 8.82 | 3.87 | 0 |
BICq(q = 0.25)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 81.5408700 3.5918084 22.701899 4.290586e-114
Smoke -2.3221935 0.5268621 -4.407592 1.045260e-05
PFP -1.9876339 0.5331587 -3.728034 1.929792e-04
PI -4.0325770 0.4899790 -8.230101 1.870551e-16
CR -0.9008511 0.1317616 -6.836978 8.088093e-12
IDR -7.9599119 0.4055681 -19.626571 9.169736e-86
PHR -1.1969635 0.2273456 -5.264951 1.402271e-07
CIP -1.9532981 0.3010998 -6.487212 8.743898e-11
CVd(d = 2463, REP = 1000)
BICq equivalent for q in (0.0467792702813636, 0.902249019702038)
Best Model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 81.5408700 3.5918084 22.701899 4.290586e-114
Smoke -2.3221935 0.5268621 -4.407592 1.045260e-05
PFP -1.9876339 0.5331587 -3.728034 1.929792e-04
PI -4.0325770 0.4899790 -8.230101 1.870551e-16
CR -0.9008511 0.1317616 -6.836978 8.088093e-12
IDR -7.9599119 0.4055681 -19.626571 9.169736e-86
PHR -1.1969635 0.2273456 -5.264951 1.402271e-07
CIP -1.9532981 0.3010998 -6.487212 8.743898e-11
I expanded the analysis to 10 x-variables and used both Cross Validation and BIC to find the best model. Both produced the same results.
> glm.fit = glm(y ~ Smoke + PFP + PI + CR + IDR + PHR + CIP, data=CHD4, family=binomial)
> glm.probs = predict(glm.fit,type="response")
> glm.forecast[glm.probs<.5]="HighRisk"
> glm.forecast[glm.probs>.5]="LowRisk"
> table(truth=CHD4[,11],prediction=glm.forecast) prediction
truth HighRisk LowRisk
0 1266 171
1 165 1273
> confusion.matrix=table(truth=CHD4[,11],prediction=glm.forecast)
> (confusion.matrix[1,2]+confusion.matrix[2,1])/sum(confusion.matrix)[1] 0.1168696
The ideal model produced an error rate of 0.1168696
PBI Risk
| Model | Error Rate |
|---|---|
| SVM | 0.28119 |
| Naive Bayes | 0.2840691 |
| Logistic Regression | 0.2831094 |
County Health Risk - 2 variables
| Model | Error Rate |
|---|---|
| SVM | 0.2514205 |
| Naive Bayes | 0.2620739 |
| Logistic Regression | 0.2627841 |
County Health Risk - 10 variables
| Model | Error Rate |
|---|---|
| SVM | 0.1221818 |
| Naive Bayes | 0.1468683 |
| Logistic Regression | 0.1168696 |