Basado en lo que se solicita en la Semana 7 y utilizando el dataset de Titanic, se procede a buscar el valor mas grande en base a gamma y
Se procede a limpiar toda la data, se almacena en la matriz titanic_clean
titanic <- read.csv("titanic.csv")
titanic_clean <- titanic %>%
select(-PassengerId,
-Name,
-Ticket,
-Cabin)
titanic_clean$Survived <- factor(titanic_clean$Survived)
titanic_clean$Pclass <- factor(titanic_clean$Pclass)
titanic_clean$Sex <- factor(titanic_clean$Sex)
titanic_clean$Embarked <- factor(titanic_clean$Embarked)
titanic_clean$Age[is.na(titanic_clean$Age)] <- median(titanic_clean$Age, na.rm = TRUE)
titanic_clean$Embarked[is.na(titanic_clean$Embarked)] <- "S"
summary(titanic_clean)
Survived Pclass Sex Age SibSp Parch Fare Embarked
0:549 1:216 female:314 Min. : 0.42 Min. :0.000 Min. :0.0000 Min. : 0.00 : 2
1:342 2:184 male :577 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91 C:168
3:491 Median :28.00 Median :0.000 Median :0.0000 Median : 14.45 Q: 77
Mean :29.36 Mean :0.523 Mean :0.3816 Mean : 32.20 S:644
3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33
Separamos la data en 2 datasets, uno para entrenar “Train” y el otro para hacer las pruebas “Test”, train contiene el 70% de los datos de titanic_clear:
train_index <- sample(1:nrow(titanic_clean),nrow(titanic_clean)*0.7)
train <- titanic_clean[train_index,]
test <- titanic_clean[-train_index,]
Creamos un ciclo para averiguar la combinacion optima de gamma y cost que nos da los mejores resultados:
#gamma 1 hasta 10
#cost 1 hasta 10
acc_matrix <- matrix(nrow=10, ncol=10)
for(j in 1:10){
for(i in 1:10){
svm_titanic <- svm(Survived~., data=train, gamma=i,cost = j)
pred <- predict(svm_titanic,test)
cm <- table(pred, original=test$Survived)
acc_matrix[i,j] <- (cm[1,1]+cm[2,2])/sum(cm)
}
}
acc_matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.7798507 0.7835821 0.7835821 0.7798507 0.7798507 0.7873134 0.7910448 0.7910448 0.7835821 0.7835821
[2,] 0.7574627 0.7649254 0.7649254 0.7611940 0.7574627 0.7574627 0.7574627 0.7574627 0.7611940 0.7574627
[3,] 0.7462687 0.7537313 0.7574627 0.7574627 0.7611940 0.7611940 0.7574627 0.7611940 0.7611940 0.7611940
[4,] 0.7313433 0.7388060 0.7388060 0.7388060 0.7350746 0.7350746 0.7350746 0.7313433 0.7313433 0.7425373
[5,] 0.7350746 0.7350746 0.7350746 0.7313433 0.7276119 0.7276119 0.7313433 0.7313433 0.7313433 0.7313433
[6,] 0.7276119 0.7350746 0.7276119 0.7276119 0.7276119 0.7313433 0.7313433 0.7313433 0.7313433 0.7313433
[7,] 0.7126866 0.7313433 0.7238806 0.7238806 0.7276119 0.7276119 0.7238806 0.7313433 0.7313433 0.7313433
[8,] 0.7126866 0.7164179 0.7238806 0.7238806 0.7201493 0.7201493 0.7313433 0.7313433 0.7313433 0.7313433
[9,] 0.7164179 0.7126866 0.7126866 0.7126866 0.7201493 0.7238806 0.7238806 0.7238806 0.7201493 0.7201493
[10,] 0.7126866 0.7164179 0.7164179 0.7126866 0.7201493 0.7238806 0.7201493 0.7201493 0.7201493 0.7201493
Por ultimo averiguamos las posiciones en donde los mejores resultados se ubican en base a gamma y cost:
which(acc_matrix == max(acc_matrix), arr.ind = T)
row col
[1,] 1 7
[2,] 1 8