Spaceship Titanic

Giriş:

Bu çalışma, yolcu bilgilerinin analizi üzerine odaklanarak, Titanic veri setini kullanarak bir sınıflandırma problemi çözmeyi amaçlamaktadır. Temel problem, veri setindeki özniteliklere dayanarak bir yolcunun hayatta kalıp kalmadığını tahmin etmektir. Bu nedenle, veri seti üzerinde temizlik işlemleri ve özellik mühendisliği uygulanarak verinin modelleme için uygun hale getirilmiştir.

Veri Ön İşleme:

Veri ön işleme aşamasında, eksik değerlerin doldurulması, kategorik değişkenlerin faktör türüne dönüştürülmesi ve yeni özelliklerin oluşturulması gibi işlemler gerçekleştirilmiştir. “Cabin” ve “Age” gibi kritik öznitelikler üzerinde özel işlemler uygulanarak veri seti hazırlanmıştır.

Kullanılan Algoritmalar:

Rassal Orman (Random Forest): Veri seti, eğitim verileri üzerinde rassal orman algoritması ile modellemek için kullanılmıştır. Bu karar ağacı temelli yöntem, birden fazla karar ağacının bir araya getirilmesiyle oluşturulan bir ansambil öğrenme modelidir.

Destek Vektör Makineleri (SVM): SVM, sınıflandırma problemine yönelik bir öğrenme algoritmasıdır. Bu çalışmada, eğitilmiş bir SVM modeli kullanılarak yolcuların hayatta kalma durumlarını tahmin etmek için test verileri üzerinde tahminler yapılmıştır.

k-En Yakın Komşu (KNN): k-NN algoritması, belirli bir örneğin sınıfını belirlemek için çevresindeki en yakın komşularının etkileşimini kullanır. Bu çalışmada, eğitilen k-NN modeli kullanılarak test verileri üzerinde tahminler gerçekleştirilmiştir.

Bu algoritmaların kullanılmasıyla, Titanic veri seti üzerindeki sınıflandırma problemine çözüm üretilerek, yolcuların hayatta kalma durumlarını öngörmek amaçlanmıştır.

işlem alanı düzeltmesi

setwd("C:/Users/acer/Desktop/AWAB 2.5")

gerekli library yüklemesi

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("readr", quietly = TRUE)
library('caret', quietly = TRUE)

## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library('randomForest', quietly = TRUE)

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library('dplyr', quietly = TRUE)
library('tidyr', quietly = TRUE)
library('stringr', quietly = TRUE)
library("e1071", quietly = TRUE)

test ve train dosyası okuması ve göstermesi

train <- read_csv("train.csv")

## Rows: 8693 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): PassengerId, HomePlanet, Cabin, Destination, Name
## dbl (6): Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
## lgl (3): CryoSleep, VIP, Transported
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(train)

## # A tibble: 6 × 14
##   PassengerId HomePlanet CryoSleep Cabin Destination     Age VIP   RoomService
##   <chr>       <chr>      <lgl>     <chr> <chr>         <dbl> <lgl>       <dbl>
## 1 0001_01     Europa     FALSE     B/0/P TRAPPIST-1e      39 FALSE           0
## 2 0002_01     Earth      FALSE     F/0/S TRAPPIST-1e      24 FALSE         109
## 3 0003_01     Europa     FALSE     A/0/S TRAPPIST-1e      58 TRUE           43
## 4 0003_02     Europa     FALSE     A/0/S TRAPPIST-1e      33 FALSE           0
## 5 0004_01     Earth      FALSE     F/1/S TRAPPIST-1e      16 FALSE         303
## 6 0005_01     Earth      FALSE     F/0/P PSO J318.5-22    44 FALSE           0
## # ℹ 6 more variables: FoodCourt <dbl>, ShoppingMall <dbl>, Spa <dbl>,
## #   VRDeck <dbl>, Name <chr>, Transported <lgl>

test <- read_csv("test.csv")

## Rows: 4277 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): PassengerId, HomePlanet, Cabin, Destination, Name
## dbl (6): Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
## lgl (2): CryoSleep, VIP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(test)

## # A tibble: 6 × 13
##   PassengerId HomePlanet CryoSleep Cabin Destination   Age VIP   RoomService
##   <chr>       <chr>      <lgl>     <chr> <chr>       <dbl> <lgl>       <dbl>
## 1 0013_01     Earth      TRUE      G/3/S TRAPPIST-1e    27 FALSE           0
## 2 0018_01     Earth      FALSE     F/4/S TRAPPIST-1e    19 FALSE           0
## 3 0019_01     Europa     TRUE      C/0/S 55 Cancri e    31 FALSE           0
## 4 0021_01     Europa     FALSE     C/1/S TRAPPIST-1e    38 FALSE           0
## 5 0023_01     Earth      FALSE     F/5/S TRAPPIST-1e    20 FALSE          10
## 6 0027_01     Earth      FALSE     F/7/P TRAPPIST-1e    31 FALSE           0
## # ℹ 5 more variables: FoodCourt <dbl>, ShoppingMall <dbl>, Spa <dbl>,
## #   VRDeck <dbl>, Name <chr>

passid <- test$PassengerId

# Separating 'PassengerId' column in 'train' dataframe
train <- separate(train, PassengerId, into = c("groups", "num"), sep = "_")

# Separating 'Cabin' column in 'train' dataframe
train <- separate(train, Cabin, into = c("c_deck", "C_num", "C_side"), sep = "/")

# Displaying the modified 'train' table after filling missing values

head(train)

## # A tibble: 6 × 17
##   groups num   HomePlanet CryoSleep c_deck C_num C_side Destination    Age VIP  
##   <chr>  <chr> <chr>      <lgl>     <chr>  <chr> <chr>  <chr>        <dbl> <lgl>
## 1 0001   01    Europa     FALSE     B      0     P      TRAPPIST-1e     39 FALSE
## 2 0002   01    Earth      FALSE     F      0     S      TRAPPIST-1e     24 FALSE
## 3 0003   01    Europa     FALSE     A      0     S      TRAPPIST-1e     58 TRUE 
## 4 0003   02    Europa     FALSE     A      0     S      TRAPPIST-1e     33 FALSE
## 5 0004   01    Earth      FALSE     F      1     S      TRAPPIST-1e     16 FALSE
## 6 0005   01    Earth      FALSE     F      0     P      PSO J318.5-…    44 FALSE
## # ℹ 7 more variables: RoomService <dbl>, FoodCourt <dbl>, ShoppingMall <dbl>,
## #   Spa <dbl>, VRDeck <dbl>, Name <chr>, Transported <lgl>

Eksik Değerlerin Doldurulması:

‘HomePlanet’, ‘Destination’, ‘RoomService’, ‘FoodCourt’, ‘ShoppingMall’, ‘Spa’, ‘VRDeck’ ve ‘CryoSleep’ sütunlarında bulunan eksik değerler, coalesce fonksiyonu kullanılarak belirli varsayılan değerlerle doldurulmuştur. Bu işlem, eksik verilerin bulunduğu hücreleri belirli bir stratejiye göre tamamlamak için yapılmıştır. Faktörlere Dönüştürme:

‘VIP’, ‘c_deck’, ve ‘C_side’ gibi belirli sütunlar, kategorik kodlamada kullanılmak üzere coalesce fonksiyonu ile faktörlere dönüştürülmüştür. Bu, modelin kategorik değişkenleri daha etkili bir şekilde işlemesini sağlar. Toplam Değerlerin Hesaplanması:

‘RoomService’, ‘FoodCourt’, ‘ShoppingMall’, ‘Spa’ ve ‘VRDeck’ sütunlarının toplamını ifade eden ‘bills’ sütunu, bu sütunların toplamını hesaplayarak oluşturulmuştur. Bu, belirli hizmet ve alanlara ilişkin maliyetleri bir araya getirerek yeni bir özellik oluşturma amacını taşır. Yaş Gruplarına Göre Kategorizasyon:

‘Age’ sütunu kullanılarak ‘age_group’ adlı yeni bir sütun oluşturulmuş ve yolcular belirli yaş gruplarına ayrılmıştır. Bu, modelin yaş bilgisini daha kategorik bir formda kullanmasını sağlar. Kategorik Sütunların Faktörlere Dönüştürülmesi:

‘VIP’, ‘Destination’, ‘HomePlanet’, ‘Transported’, ‘CryoSleep’, ‘c_deck’ ve ‘C_num’ sütunları, kategorik kodlamada kullanılmak üzere as.factor fonksiyonu ile faktörlere dönüştürülmüştür. Bu, kategorik değişkenlerin model tarafından daha doğru bir şekilde yorumlanabilmesini sağlar.

# Filling missing values in certain columns with default values
train$HomePlanet <- coalesce(train$HomePlanet, "Earth")
train$Destination <- coalesce(train$Destination, "TRAPPIST-1e")
train$RoomService <- coalesce(train$RoomService, 0)
train$FoodCourt <- coalesce(train$FoodCourt, 0)
train$ShoppingMall <- coalesce(train$ShoppingMall, 0)
train$Spa <- coalesce(train$Spa, 0)
train$VRDeck <- coalesce(train$VRDeck, 0)
train$CryoSleep <- coalesce(train$CryoSleep, as.logical("FALSE"))

# Displaying the modified columns after filling missing values

cat("HomePlanet: ", head(train$HomePlanet), "\n")

## HomePlanet:  Europa Earth Europa Europa Earth Earth

cat("Destination: ", head(train$Destination), "\n")

## Destination:  TRAPPIST-1e TRAPPIST-1e TRAPPIST-1e TRAPPIST-1e TRAPPIST-1e PSO J318.5-22

cat("RoomService: ", head(train$RoomService), "\n")

## RoomService:  0 109 43 0 303 0

cat("FoodCourt: ", head(train$FoodCourt), "\n")

## FoodCourt:  0 9 3576 1283 70 483

cat("ShoppingMall: ", head(train$ShoppingMall), "\n")

## ShoppingMall:  0 25 0 371 151 0

cat("Spa: ", head(train$Spa), "\n")

## Spa:  0 549 6715 3329 565 291

cat("VRDeck: ", head(train$VRDeck), "\n")

## VRDeck:  0 44 49 193 2 0

cat("CryoSleep: ", head(train$CryoSleep), "\n")

## CryoSleep:  FALSE FALSE FALSE FALSE FALSE FALSE

# Converting certain columns to factors for categorical encoding
train$VIP <- coalesce(train$VIP, as.logical("FALSE"))
train$c_deck <- coalesce(train$c_deck, "N")
train$C_side <- coalesce(train$C_side, "N")

# Displaying the modified columns after conversion to factors

cat("VIP: ", head(train$VIP), "\n")

## VIP:  FALSE FALSE TRUE FALSE FALSE FALSE

cat("c_deck: ", head(train$c_deck), "\n")

## c_deck:  B F A A F F

cat("C_side: ", head(train$C_side), "\n")

## C_side:  P S S S S P

# Calculating the sum of certain columns and storing the result
train$bills <- train$RoomService + train$FoodCourt + train$ShoppingMall + train$Spa + train$VRDeck

# Categorizing passengers into age groups based on the 'Age' column
train$age_group <- case_when(
  train$Age <= 10 ~ 0,
  train$Age <= 20 ~ 1,
  train$Age <= 30 ~ 2,
  train$Age <= 40 ~ 3,
  train$Age <= 50 ~ 4,
  train$Age <= 60 ~ 5,
  train$Age <= 70 ~ 6,
  train$Age <= 80 ~ 7,
  TRUE ~ 99
)

# Converting certain columns to factors for categorical encoding
train$VIP <- as.factor(train$VIP)
train$Destination <- as.factor(train$Destination)
train$HomePlanet <- as.factor(train$HomePlanet)
train$Transported <- as.factor(train$Transported)
train$CryoSleep <- as.factor(train$CryoSleep)
train$c_deck <- as.factor(train$c_deck)
train$C_num <- as.factor(train$C_num)

# Scatter plot of Age against Bills, colored by Transported status
ggplot(train, aes(x = Age, y = bills, color = Transported)) +
  geom_point() +
  labs(title = "Scatter Plot of Age vs Bills (Colored by Transported Status)", x = "Age", y = "Bills")

## Warning: Removed 179 rows containing missing values (`geom_point()`).

# Creating a new dataframe 'df' by selecting specific columns for modeling
df <- train %>% select(-c(Name, groups, C_num, Age))
# Displaying the summary of the 'df' dataframe
summary(df)

##      num             HomePlanet   CryoSleep        c_deck    
##  Length:8693        Earth :4803   FALSE:5656   F      :2794  
##  Class :character   Europa:2131   TRUE :3037   G      :2559  
##  Mode  :character   Mars  :1759                E      : 876  
##                                                B      : 779  
##                                                C      : 747  
##                                                D      : 478  
##                                                (Other): 460  
##     C_side                 Destination      VIP        RoomService   
##  Length:8693        55 Cancri e  :1800   FALSE:8494   Min.   :    0  
##  Class :character   PSO J318.5-22: 796   TRUE : 199   1st Qu.:    0  
##  Mode  :character   TRAPPIST-1e  :6097                Median :    0  
##                                                       Mean   :  220  
##                                                       3rd Qu.:   41  
##                                                       Max.   :14327  
##                                                                      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  448.4   Mean   :  169.6   Mean   :  304.6   Mean   :  298.3  
##  3rd Qu.:   61.0   3rd Qu.:   22.0   3rd Qu.:   53.0   3rd Qu.:   40.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##                                                                         
##  Transported      bills         age_group    
##  FALSE:4315   Min.   :    0   Min.   : 0.00  
##  TRUE :4378   1st Qu.:    0   1st Qu.: 1.00  
##               Median :  716   Median : 2.00  
##               Mean   : 1441   Mean   : 4.34  
##               3rd Qu.: 1441   3rd Qu.: 3.00  
##               Max.   :35987   Max.   :99.00  
##

model <- randomForest(Transported ~ ., data = df, importance = TRUE, na.action = na.omit)
# Displaying information about the trained random forest model
print(model)

## 
## Call:
##  randomForest(formula = Transported ~ ., data = df, importance = TRUE,      na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 20.13%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE  3299 1016   0.2354577
## TRUE    734 3644   0.1676565

prediction <- predict(model, df)

# Creating a confusion matrix to evaluate model performance

conMat <- confusionMatrix(df$Transported, prediction)

# Displaying the confusion matrix

print(conMat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  3595  720
##      TRUE     23 4355
##                                           
##                Accuracy : 0.9145          
##                  95% CI : (0.9085, 0.9203)
##     No Information Rate : 0.5838          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8288          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9936          
##             Specificity : 0.8581          
##          Pos Pred Value : 0.8331          
##          Neg Pred Value : 0.9947          
##              Prevalence : 0.4162          
##          Detection Rate : 0.4136          
##    Detection Prevalence : 0.4964          
##       Balanced Accuracy : 0.9259          
##                                           
##        'Positive' Class : FALSE           
##

# Extracting confusion matrix data
conMatData <- as.table(conMat)

# Creating a data frame for plotting
confusion_df <- data.frame(Reference = rep(rownames(conMatData), each = ncol(conMatData)),
                           Prediction = rep(colnames(conMatData), times = nrow(conMatData)),
                           Freq = as.vector(conMatData))

# Creating a heatmap
ggplot(confusion_df, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Confusion Matrix Heatmap",
       x = "Reference",
       y = "Prediction") +
  theme_minimal()

Sütun Ayırma İşlemi: Test verilerindeki ‘PassengerId’ ve ‘Cabin’ sütunları, separate fonksiyonu kullanılarak belirli ayırma işlemlerine tabi tutulmuştur. Bu, veri setindeki bilgileri daha ayrıntılı ve kullanışlı hale getirme amacını taşımaktadır. ‘PassengerId’ sütunu ‘groups’ ve ‘num’ olmak üzere iki ayrı sütuna bölünmüş, ‘Cabin’ sütunu ise ‘c_deck’, ‘C_num’, ve ‘C_side’ olmak üzere üç ayrı sütuna bölünmüştür.

Eksik Değerleri Doldurma İşlemi: Belirli sütunlardaki eksik değerler, coalesce fonksiyonu kullanılarak varsayılan değerlerle doldurulmuştur. Örneğin, ‘HomePlanet’ sütunu için eksik değerler ‘Dünya’ ile, ‘Destination’ sütunu için eksik değerler ‘TRAPPIST-1e’ ile doldurulmuştur. Diğer sütunlar için de benzer şekilde eksik değerler varsayılan değerlerle tamamlanmıştır.

Faktörlere Dönüştürme İşlemi: Kategorik kodlamada kullanılmak üzere belirli sütunlar, örneğin ‘VIP’, ‘c_deck’, ve ‘C_side’, as.factor fonksiyonu kullanılarak faktörlere dönüştürülmüştür. Bu, modelin kategorik bilgileri daha etkili bir şekilde işlemesine olanak tanır.

Toplam Fatura Hesaplama: ‘RoomService’, ‘FoodCourt’, ‘ShoppingMall’, ‘Spa’ ve ‘VRDeck’ sütunlarının toplamını ifade eden ‘bills’ sütunu, bu sütunların toplamını hesaplayarak oluşturulmuştur.

Yaş Gruplarına Ayırma İşlemi: ‘Age’ sütunu kullanılarak yolcular, belirlenen yaş gruplarına ayrılmıştır. Bu, modelin yaş bilgisini daha geniş bir bağlamda ele almasına olanak tanır.

Tahmin İçin Veri Çerçevesi Oluşturma: Son olarak, belirli sütunları seçerek yeni bir veri çerçevesi olan ‘df_test’ oluşturulmuştur. Bu veri çerçevesi, modelin tahmin yapmak için kullanacağı özellikleri içermektedir.

test <- separate(test, PassengerId, into = c("groups", "num"), sep = "_")
test <- separate(test, Cabin, into = c("c_deck", "C_num", "C_side"), sep = "/")

head(test)

## # A tibble: 6 × 16
##   groups num   HomePlanet CryoSleep c_deck C_num C_side Destination   Age VIP  
##   <chr>  <chr> <chr>      <lgl>     <chr>  <chr> <chr>  <chr>       <dbl> <lgl>
## 1 0013   01    Earth      TRUE      G      3     S      TRAPPIST-1e    27 FALSE
## 2 0018   01    Earth      FALSE     F      4     S      TRAPPIST-1e    19 FALSE
## 3 0019   01    Europa     TRUE      C      0     S      55 Cancri e    31 FALSE
## 4 0021   01    Europa     FALSE     C      1     S      TRAPPIST-1e    38 FALSE
## 5 0023   01    Earth      FALSE     F      5     S      TRAPPIST-1e    20 FALSE
## 6 0027   01    Earth      FALSE     F      7     P      TRAPPIST-1e    31 FALSE
## # ℹ 6 more variables: RoomService <dbl>, FoodCourt <dbl>, ShoppingMall <dbl>,
## #   Spa <dbl>, VRDeck <dbl>, Name <chr>

# Filling missing values in certain columns with default values
test$HomePlanet <- coalesce(test$HomePlanet, "Earth")
test$Destination <- coalesce(test$Destination, "TRAPPIST-1e")
test$RoomService <- coalesce(test$RoomService, 0)
test$FoodCourt <- coalesce(test$FoodCourt, 0)
test$ShoppingMall <- coalesce(test$ShoppingMall, 0)
test$Spa <- coalesce(test$Spa, 0)
test$VRDeck <- coalesce(test$VRDeck, 0)
test$CryoSleep <- coalesce(test$CryoSleep, as.logical("FALSE"))

summary(test[, sapply(test, is.numeric)])

##       Age         RoomService        FoodCourt        ShoppingMall   
##  Min.   : 0.00   Min.   :    0.0   Min.   :    0.0   Min.   :   0.0  
##  1st Qu.:19.00   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:   0.0  
##  Median :26.00   Median :    0.0   Median :    0.0   Median :   0.0  
##  Mean   :28.66   Mean   :  215.1   Mean   :  428.6   Mean   : 173.2  
##  3rd Qu.:37.00   3rd Qu.:   48.0   3rd Qu.:   66.0   3rd Qu.:  27.0  
##  Max.   :79.00   Max.   :11567.0   Max.   :25273.0   Max.   :8292.0  
##  NA's   :91                                                          
##       Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0  
##  Mean   :  295.9   Mean   :  304.9  
##  3rd Qu.:   43.0   3rd Qu.:   31.0  
##  Max.   :19844.0   Max.   :22272.0  
##

# Plot bar chart for 'HomePlanet'
barplot(table(test$HomePlanet), main = "Distribution of HomePlanet", col = "skyblue", xlab = "HomePlanet")

# Pie chart for 'VIP' distribution
pie(table(test$VIP), main = "VIP Distribution", col = c("skyblue", "lightcoral"))

# Histogram for 'Age'
hist(test$Age, main = "Age Distribution", col = "lightgreen", xlab = "Age")

# Converting certain columns to factors for categorical encoding
test$VIP <- coalesce(test$VIP, as.logical("FALSE"))
test$c_deck <- coalesce(test$c_deck, "N")
test$C_side <- coalesce(test$C_side, "N")

# Calculating the sum of 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', and 'VRDeck' columns
test$bills <- test$RoomService + test$FoodCourt + test$ShoppingMall + test$Spa + test$VRDeck

# Categorizing passengers into age groups based on the 'Age' column
test$age_group <- case_when(
  test$Age <= 10 ~ 0,
  test$Age <= 20 ~ 1,
  test$Age <= 30 ~ 2,
  test$Age <= 40 ~ 3,
  test$Age <= 50 ~ 4,
  test$Age <= 60 ~ 5,
  test$Age <= 70 ~ 6,
  test$Age <= 80 ~ 7,
  TRUE ~ 99
)

# Converting certain columns to factors for categorical encoding
test$VIP <- as.factor(test$VIP)
test$Destination <- as.factor(test$Destination)
test$HomePlanet <- as.factor(test$HomePlanet)
test$CryoSleep <- as.factor(test$CryoSleep)
test$c_deck <- as.factor(test$c_deck)
test$C_num <- as.factor(test$C_num)

# Creating a new dataframe 'df_test' by selecting specific columns for prediction
df_test <- test %>% select(-c(Name))  # Exclude 'Name', but include 'age_group'
  # Exclude 'Name' and 'age_group'

summary(df_test)

##     groups              num             HomePlanet   CryoSleep   
##  Length:4277        Length:4277        Earth :2350   FALSE:2733  
##  Class :character   Class :character   Europa:1002   TRUE :1544  
##  Mode  :character   Mode  :character   Mars  : 925               
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##      c_deck         C_num         C_side                 Destination  
##  F      :1445   4      :  21   Length:4277        55 Cancri e  : 841  
##  G      :1222   31     :  18   Class :character   PSO J318.5-22: 388  
##  E      : 447   197    :  16   Mode  :character   TRAPPIST-1e  :3048  
##  B      : 362   294    :  16                                          
##  C      : 355   228    :  14                                          
##  D      : 242   (Other):4092                                          
##  (Other): 204   NA's   : 100                                          
##       Age           VIP        RoomService        FoodCourt      
##  Min.   : 0.00   FALSE:4203   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:19.00   TRUE :  74   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :26.00                Median :    0.0   Median :    0.0  
##  Mean   :28.66                Mean   :  215.1   Mean   :  428.6  
##  3rd Qu.:37.00                3rd Qu.:   48.0   3rd Qu.:   66.0  
##  Max.   :79.00                Max.   :11567.0   Max.   :25273.0  
##  NA's   :91                                                      
##   ShoppingMall         Spa              VRDeck            bills      
##  Min.   :   0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0  
##  1st Qu.:   0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0  
##  Median :   0.0   Median :    0.0   Median :    0.0   Median :  714  
##  Mean   : 173.2   Mean   :  295.9   Mean   :  304.9   Mean   : 1418  
##  3rd Qu.:  27.0   3rd Qu.:   43.0   3rd Qu.:   31.0   3rd Qu.: 1444  
##  Max.   :8292.0   Max.   :19844.0   Max.   :22272.0   Max.   :33666  
##                                                                      
##    age_group     
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 2.000  
##  Mean   : 4.389  
##  3rd Qu.: 3.000  
##  Max.   :99.000  
##

# Making predictions on the test data using the trained model for Random Forest
rf_model <- predict(model, df_test)

summary(rf_model)

## FALSE  TRUE 
##  1965  2312

submit_rf <- data.frame(PassengerId = passid, Transported = str_to_title(rf_model))

write_csv(submit_rf, "C:/Users/acer/Desktop/AWAB 2.5/submission_rf.csv", col_names = TRUE)

cat("Random Forest results saved to 'submission_rf.csv'.\n")

## Random Forest results saved to 'submission_rf.csv'.

head(submit_rf)

##   PassengerId Transported
## 1     0013_01        True
## 2     0018_01       False
## 3     0019_01        True
## 4     0021_01        True
## 5     0023_01       False
## 6     0027_01       False

k-En Yakın Komşular (KNN) Modeli Eğitimi: Bu bölümde, ‘knn’ methodu kullanılarak bir k-en yakın komşular modeli eğitilir.

# Creating k-Nearest Neighbors model
knn_model <- train(Transported ~ ., data = df, method = "knn")

# Summary of knn_model
summary(knn_model)

##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      30     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

KNN ile Test Verisi Üzerinde Tahminler: Bu bölümde, eğitilen k-NN modeli kullanılarak test veri seti üzerinde tahminler yapılır. Elde edilen tahminler, bir veri çerçevesine dönüştürülerek belirli bir dosyaya kaydedilir.

# Making predictions on the test data using the trained model for k-Nearest Neighbors (KNN)
knn_prediction <- predict(knn_model, df_test)

# Creating a dataframe for KNN submission
knn_submit <- data.frame(PassengerId = passid, Transported = str_to_title(knn_prediction))

# Writing KNN results to a CSV file
write_csv(knn_submit, "C:/Users/acer/Desktop/AWAB 2.5/submission_knn.csv", col_names = TRUE)

# Displaying a message indicating that KNN results are saved
cat("k-Nearest Neighbors (KNN) results saved to 'submission_knn.csv'.\n")

## k-Nearest Neighbors (KNN) results saved to 'submission_knn.csv'.

# Displaying the head of the KNN submission dataframe
head(knn_submit)

##   PassengerId Transported
## 1     0013_01        True
## 2     0018_01       False
## 3     0019_01        True
## 4     0021_01        True
## 5     0023_01       False
## 6     0027_01        True

Naive Bayes Modeli Eğitimi: Bu bölümde, ‘naiveBayes’ kütüphanesi kullanılarak bir Naive Bayes modeli eğitilir.

# Creating a Naive Bayes model
nb_model <- naiveBayes(Transported ~ ., data = df)


# Making predictions on the test data using the trained Naive Bayes model
nb_prediction <- predict(nb_model, df_test)

# Creating a dataframe for Naive Bayes submission
nb_submit <- data.frame(PassengerId = passid, Transported = str_to_title(as.character(nb_prediction)))

# Writing Naive Bayes results to a CSV file
write_csv(nb_submit, "C:/Users/acer/Desktop/AWAB 2.5/submission_nb.csv", col_names = TRUE)

# Displaying a message indicating that Naive Bayes results are saved
cat("Naive Bayes results saved to 'submission_nb.csv'.\n")

## Naive Bayes results saved to 'submission_nb.csv'.

# Displaying the head of the Naive Bayes submission dataframe
head(nb_submit)

##   PassengerId Transported
## 1     0013_01        True
## 2     0018_01       False
## 3     0019_01        True
## 4     0021_01        True
## 5     0023_01        True
## 6     0027_01        True

Naive Bayes ile Test Verisi Üzerinde Tahminler:

Bu bölümde, eğitilen Naive Bayes modeli kullanılarak test veri seti üzerinde tahminler yapılır. Elde edilen tahminler, bir veri çerçevesine dönüştürülerek belirli bir dosyaya kaydedilir.

var_imp_plot_rf <- varImp(model, scale = FALSE)
print(var_imp_plot_rf)

##                      FALSE          TRUE
## num           6.368924e-04  6.368924e-04
## HomePlanet    1.539067e-02  1.539067e-02
## CryoSleep     4.399063e-02  4.399063e-02
## c_deck        1.829961e-02  1.829961e-02
## C_side        3.858145e-03  3.858145e-03
## Destination   3.165958e-03  3.165958e-03
## VIP          -6.598761e-05 -6.598761e-05
## RoomService   3.379596e-02  3.379596e-02
## FoodCourt     2.788754e-02  2.788754e-02
## ShoppingMall  1.726626e-02  1.726626e-02
## Spa           3.975920e-02  3.975920e-02
## VRDeck        3.561655e-02  3.561655e-02
## bills         9.635589e-02  9.635589e-02
## age_group     4.900259e-03  4.900259e-03

# Assuming 'knn_model' is the KNN model created with the train function
knn_summary_df <- as.data.frame(summary(knn_model))
print(knn_summary_df)

##           Var1   Var2       Freq
## 1        learn Length          2
## 2            k Length          1
## 3      theDots Length          0
## 4       xNames Length         30
## 5  problemType Length          1
## 6    tuneValue Length          1
## 7    obsLevels Length          2
## 8        param Length          0
## 9        learn  Class     -none-
## 10           k  Class     -none-
## 11     theDots  Class     -none-
## 12      xNames  Class     -none-
## 13 problemType  Class     -none-
## 14   tuneValue  Class data.frame
## 15   obsLevels  Class     -none-
## 16       param  Class     -none-
## 17       learn   Mode       list
## 18           k   Mode    numeric
## 19     theDots   Mode       list
## 20      xNames   Mode  character
## 21 problemType   Mode  character
## 22   tuneValue   Mode       list
## 23   obsLevels   Mode  character
## 24       param   Mode       list

# Assuming 'nb_model' is the Naive Bayes model created with the naiveBayes function
nb_summary_df <- as.data.frame(summary(nb_model))
print(nb_summary_df)

##         Var1   Var2      Freq
## 1    apriori Length         2
## 2     tables Length        14
## 3     levels Length         2
## 4  isnumeric Length        14
## 5       call Length         4
## 6    apriori  Class     table
## 7     tables  Class    -none-
## 8     levels  Class    -none-
## 9  isnumeric  Class    -none-
## 10      call  Class    -none-
## 11   apriori   Mode   numeric
## 12    tables   Mode      list
## 13    levels   Mode character
## 14 isnumeric   Mode   logical
## 15      call   Mode      call

Spaceship Titanic

AWAB ADIL HASSAN ABDALLA

2024-01-20