Introducción

Cargamos las librerías necesarias.

library(readr)
library(e1071)

Cargamos las bases de datos.

xtrain <- read_fwf("Dataset. Cancer/14cancer.xtrain.txt")

## Rows: 16063 Columns: 144
## ── Column specification ────────────────────────────────────────────────────────
## 
## dbl (144): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ytrain <- read_fwf("Dataset. Cancer/14cancer.ytrain.txt")

## Rows: 1 Columns: 144
## ── Column specification ────────────────────────────────────────────────────────
## 
## dbl (144): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

xtest<- read_fwf("Dataset. Cancer/14cancer.xtest.txt")

## Rows: 16063 Columns: 54
## ── Column specification ────────────────────────────────────────────────────────
## 
## dbl (54): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ytest <- read_fwf("Dataset. Cancer/14cancer.ytest.txt")

## Rows: 1 Columns: 54
## ── Column specification ────────────────────────────────────────────────────────
## 
## dbl (54): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Modificamos la base de datos tal como pide el ejercicio

xtrain2 <- data.frame(t(xtrain))
xtest2 <- data.frame(t(xtest))

Examinamos las dimensiones de la base de datos.

dim(xtrain2)

## [1]   144 16063

dim(xtest2)

## [1]    54 16063

Modificamos el dataframe de ytrain y ytest.

ytrain2 = as.factor(ytrain)

## Warning in xtfrm.data.frame(x): cannot xtfrm data frames

ytest2 = as.factor(ytest)

## Warning in xtfrm.data.frame(x): cannot xtfrm data frames

Observamos tanto ytrain2 como ytest2.

table(ytrain2)

## ytrain2
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 
##  8  8  8  8 16  8  8  8 24  8  8  8  8 16

table(ytest2)

## ytest2
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 
##  4  6  4  4  6  3  2  2  6  3  3  4  3  4

Modelo de Support Vector Machines

En primer lugar creamos una nueva base de datos.

dat <- data.frame(x = xtrain2, y = ytrain2)

En este conjunto de datos, hay un número muy elevado de características en relación con el número de observaciones. Esto sugiere que que deberíamos utilizar un kernel lineal, porque la flexibilidad adicional que de utilizar un kernel polinómico o radial es innecesaria. Creamos la SVM y la visualizamos.

out <- svm(y ~ ., data = dat, kernel = "linear", cost = 10, scale = TRUE)
summary(out)

## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  123
## 
##  ( 8 8 8 8 16 8 8 7 11 8 8 8 8 9 )
## 
## 
## Number of Classes:  14 
## 
## Levels: 
##  1 2 3 4 5 6 7 8 9 10 11 12 13 14

Revisamos los nombres de salida

names(out)

##  [1] "call"            "type"            "kernel"          "cost"           
##  [5] "degree"          "gamma"           "coef0"           "nu"             
##  [9] "epsilon"         "sparse"          "scaled"          "x.scale"        
## [13] "y.scale"         "nclasses"        "levels"          "tot.nSV"        
## [17] "nSV"             "labels"          "SV"              "index"          
## [21] "rho"             "compprob"        "probA"           "probB"          
## [25] "sigma"           "coefs"           "na.action"       "fitted"         
## [29] "decision.values" "terms"

Creamos una tabla en la que comparamos fitted con la variable y.

table(out$fitted, dat$y)

##     
##       1  2  3  4  5  6  7  8  9 10 11 12 13 14
##   1   8  0  0  0  0  0  0  0  0  0  0  0  0  0
##   2   0  8  0  0  0  0  0  0  0  0  0  0  0  0
##   3   0  0  8  0  0  0  0  0  0  0  0  0  0  0
##   4   0  0  0  8  0  0  0  0  0  0  0  0  0  0
##   5   0  0  0  0 16  0  0  0  0  0  0  0  0  0
##   6   0  0  0  0  0  8  0  0  0  0  0  0  0  0
##   7   0  0  0  0  0  0  8  0  0  0  0  0  0  0
##   8   0  0  0  0  0  0  0  8  0  0  0  0  0  0
##   9   0  0  0  0  0  0  0  0 24  0  0  0  0  0
##   10  0  0  0  0  0  0  0  0  0  8  0  0  0  0
##   11  0  0  0  0  0  0  0  0  0  0  8  0  0  0
##   12  0  0  0  0  0  0  0  0  0  0  0  8  0  0
##   13  0  0  0  0  0  0  0  0  0  0  0  0  8  0
##   14  0  0  0  0  0  0  0  0  0  0  0  0  0 16

Vemos que no hay errores de formación. De hecho, tal como en el manual, esto no es sorprendente, ya que el gran número de variables en relación con el número de observaciones implica que es fácil encontrar hiperplanos que separen completamente las clases.Lo que más nos interesa no es el rendimiento del clasificador de vectores de soporte en las observaciones de entrenamiento, sino su rendimiento en las observaciones de prueba.

Rendimiento en las observaciones de prueba

Creamos la base de datos de prueba y observamos la tabla.

dat.te <- data.frame(x = xtest2, y = ytest2)
pred.te <- predict(out, newdata = dat.te)
table(pred.te, dat.te$y)

##        
## pred.te 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##      1  0 0 0 0 0 0 0 1 0  0  0  0  0  0
##      2  0 2 1 0 0 0 0 0 1  0  0  0  0  0
##      3  0 0 2 0 1 0 0 0 0  0  0  0  0  0
##      4  0 0 1 4 0 0 0 0 0  0  0  0  1  0
##      5  0 0 0 0 4 0 0 0 0  0  0  0  0  0
##      6  2 1 0 0 0 3 1 1 0  1  0  0  0  0
##      7  0 0 0 0 0 0 1 0 0  0  0  0  0  0
##      8  1 1 0 0 0 0 0 0 0  2  0  0  0  0
##      9  0 0 0 0 0 0 0 0 5  0  0  0  0  0
##      10 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##      11 0 0 0 0 0 0 0 0 0  0  3  1  0  0
##      12 1 0 0 0 1 0 0 0 0  0  0  2  0  0
##      13 0 1 0 0 0 0 0 0 0  0  0  1  2  0
##      14 0 1 0 0 0 0 0 0 0  0  0  0  0  4

Vemos que utilizando cost = 10 se obtienen 19 errores en el conjunto de prueba con estos datos.

Tema 8: Maquinas de vector soporte

Diego Calderón

Agosto 2023

Introducción

Modelo de Support Vector Machines

Rendimiento en las observaciones de prueba