Titanic Uzay Gemisi

Author

Olamgır Olımov

Published

October 8, 2025

Genel Bilgi

2912 yılına hoş geldiniz. Veri bilimi becerilerinize kozmik bir gizemi çözmek için ihtiyaç duyuluyor. Dört ışık yılı uzaklıktan bir iletim aldık ve işler pek de iyi görünmüyor.

Spaceship Titanic, bir ay önce fırlatılan yıldızlararası bir yolcu gemisiydi. Neredeyse 13.000 yolcuyu taşıyan gemi, göçmenleri güneş sistemimizden, yakındaki yıldızların etrafında dönen üç yeni yaşanabilir ötegezegene götürmek üzere ilk yolculuğuna çıkmıştı.

İlk durağı olan kavurucu 55 Cancri E’ye giderken, gemi Alpha Centauri yakınlarında bir toz bulutunun içinde gizlenmiş bir uzay-zaman anomalisiyle çarpıştı. Ne yazık ki, bin yıl önceki adaşıyla benzer bir kaderi paylaştı. Gemi bütünlüğünü korumuş olsa da, yolcuların neredeyse yarısı alternatif bir boyuta taşındı!

Kurtarma ekiplerine yardımcı olmak ve kaybolan yolcuları geri getirmek için, uzay gemisinin hasar görmüş bilgisayar sisteminden kurtarılan kayıtları kullanarak hangi yolcuların anomali tarafından taşındığını tahmin etmeniz isteniyor.

Onları kurtarın ve tarihi değiştirin!

library(readr)
train <- read_csv("train.csv")
Rows: 8693 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): PassengerId, HomePlanet, Cabin, Destination, Name
dbl (6): Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
lgl (3): CryoSleep, VIP, Transported

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test <- read_csv("test.csv")
Rows: 4277 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): PassengerId, HomePlanet, Cabin, Destination, Name
dbl (6): Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
lgl (2): CryoSleep, VIP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Dosya ve Veri Alanı Açıklamaları

  • train.csv - Eğitim verisi olarak kullanılacak yolcuların yaklaşık üçte ikisinin (yaklaşık 8700) kişisel kayıtları.

  • PassengerId - Her yolcu için benzersiz bir kimlik. Her kimlik, gggg_pp biçimini alır; burada gggg, yolcunun seyahat ettiği grubu, pp ise grup içindeki numarasını belirtir. Bir gruptaki kişiler genellikle aile üyeleridir, ancak her zaman değil.

  • HomePlanet - Yolcunun ayrıldığı gezegen, genellikle daimi ikamet ettiği gezegen.

  • CryoSleep - Yolcunun yolculuk süresince askıya alınmış animasyona alınmayı seçip seçmediğini belirtir. Cryosleep’teki yolcular kabinlerine kapatılır.

  • Cabin - Yolcunun kaldığı kabin numarası. deck/num/side biçimini alır; burada side, P (İskele) veya S (Sancak) olabilir.

  • Destination - Yolcunun ineceği gezegen.

  • Age - Yolcunun yaşı.

  • VIP - Yolcunun yolculuk sırasında özel VIP hizmeti için ödeme yapıp yapmadığı.

  • Room Service, FoodCourt, Shopping Mall, Spa, VRDeck - Yolcunun Titanic Uzay Gemisi’nin birçok lüks olanağından her biri için ödediği fatura tutarı.

  • Name - Yolcunun adı ve soyadı.

  • Transported - Yolcunun başka bir boyuta taşınıp taşınmadığı. Bu, tahmin etmeye çalıştığınız hedef sütundur.

  • test.csv - Test verisi olarak kullanılacak kalan üçte bir yolcunun (~4300) kişisel kayıtları.

  • PassengerId - Test kümesindeki her yolcunun kimliği.

  • Transported - Hedef. Her yolcu için Doğru veya Yanlış tahmin edin.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.1.0
✔ forcats   1.0.1     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
train %>% is.na() %>% colSums()
 PassengerId   HomePlanet    CryoSleep        Cabin  Destination          Age 
           0          201          217          199          182          179 
         VIP  RoomService    FoodCourt ShoppingMall          Spa       VRDeck 
         203          181          183          208          183          188 
        Name  Transported 
         200            0 
train2 <- train %>%
  separate(PassengerId, sep = "_", into = c("Aile", "KisiNo"), remove = FALSE) %>%
  separate(Name, sep = " ", into = c("Ad", "Soyad"), remove = FALSE)
unique(train$HomePlanet)
[1] "Europa" "Earth"  "Mars"   NA      
table(train$HomePlanet)

 Earth Europa   Mars 
  4602   2131   1759 
train2 %>% group_by(Aile) %>%
  summarise(
    uye_s = n(),
    bos_sayisi = sum(is.na(HomePlanet)),
    bosolamyan_sayisi = sum(!is.na(HomePlanet)),
    farkli_gezegen = n_distinct(HomePlanet,na.rm = TRUE)
  )
# A tibble: 6,217 × 5
   Aile  uye_s bos_sayisi bosolamyan_sayisi farkli_gezegen
   <chr> <int>      <int>             <int>          <int>
 1 0001      1          0                 1              1
 2 0002      1          0                 1              1
 3 0003      2          0                 2              1
 4 0004      1          0                 1              1
 5 0005      1          0                 1              1
 6 0006      2          0                 2              1
 7 0007      1          0                 1              1
 8 0008      3          0                 3              1
 9 0009      1          0                 1              1
10 0010      1          0                 1              1
# ℹ 6,207 more rows
incelemehp <- train2 %>% group_by(Aile) %>%
  summarise(
    uye_s = n(),
    bos_sayisi = sum(is.na(HomePlanet)),
    bosolamyan_sayisi = sum(!is.na(HomePlanet)),
    farkli_gezegen = n_distinct(HomePlanet,na.rm = TRUE)
  ) %>% ungroup()
inceleme2 <- train2 %>% group_by(Aile) %>%
  summarise(
    uye_s = n(),
    bos_sayisi = sum(is.na(HomePlanet)),
    bosolamyan_sayisi = sum(!is.na(HomePlanet)),
    farkli_gezegen = n_distinct(HomePlanet,na.rm = TRUE),
    gezegen_adi = unique(HomePlanet[!is.na(HomePlanet)])
  ) 
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'Aile'. You can override using the
`.groups` argument.
library(explore)
describe_all(inceleme2)
# A tibble: 6 × 8
  variable          type     na na_pct unique   min  mean   max
  <chr>             <chr> <int>  <dbl>  <int> <dbl> <dbl> <dbl>
1 Aile              chr       0      0   6107    NA NA       NA
2 uye_s             int       0      0      8     1  1.41     8
3 bos_sayisi        int       0      0      3     0  0.01     2
4 bosolamyan_sayisi int       0      0      8     1  1.39     8
5 farkli_gezegen    int       0      0      1     1  1        1
6 gezegen_adi       chr       0      0      3    NA NA       NA
train2 <- left_join(train2,inceleme2)
Joining with `by = join_by(Aile)`

HomePlanet ve Destination

library(dplyr)
library(tidyr)

# 1. Mod (en sık görülen değer) bulma fonksiyonu
get_mode <- function(x) {
  x <- x[!is.na(x)]
  if (length(x) == 0) return(NA)
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# 2. Veri setini düzenle
train2 <- train %>%
  separate(PassengerId, sep = "_", into = c("Aile", "KisiNo"), remove = FALSE) %>%
  separate(Name, sep = " ", into = c("Ad", "Soyad"), remove = FALSE)

# 3. Genel mod değerleri
mod_homeplanet <- get_mode(train2$HomePlanet)
mod_destination <- get_mode(train2$Destination)

# 4. Aile bazında doldurma işlemi
train2 <- train2 %>%
  group_by(Aile) %>%
  mutate(
    grup_boyutu = n(),
    # Ailede HomePlanet bilgisi varsa, onu kullan; yoksa genel mod
    HomePlanet = ifelse(
      is.na(HomePlanet),
      ifelse(
        all(is.na(HomePlanet)),
        mod_homeplanet,
        first(HomePlanet[!is.na(HomePlanet)])
      ),
      HomePlanet
    ),
    # Ailede Destination bilgisi varsa, onu kullan; yoksa genel mod
    Destination = ifelse(
      is.na(Destination),
      ifelse(
        all(is.na(Destination)),
        mod_destination,
        first(Destination[!is.na(Destination)])
      ),
      Destination
    )
  ) %>%
  ungroup() %>%
  select(-grup_boyutu)
train2 %>% is.na() %>% colSums()
 PassengerId         Aile       KisiNo   HomePlanet    CryoSleep        Cabin 
           0            0            0            0          217          199 
 Destination          Age          VIP  RoomService    FoodCourt ShoppingMall 
           0          179          203          181          183          208 
         Spa       VRDeck         Name           Ad        Soyad  Transported 
         183          188          200          200          200            0 
soyisimdatasi <- train2 %>% filter(!is.na(HomePlanet)) %>% group_by(Soyad) %>% 
  summarise(n = n(),
            en_cok_gezegen = names(which.max(table(HomePlanet))),
            pay = max(table(HomePlanet))/n
            
            )
library(VIM)
Загрузка требуемого пакета: colorspace
Загрузка требуемого пакета: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

Присоединяю пакет: 'VIM'
Следующий объект скрыт от 'package:datasets':

    sleep
train3 <- hotdeck(soyisimdatasi,variable = 'Soyad')
train2 <- train2 %>% separate(Cabin, into = c('deck','num','side'), sep = '/', remove = FALSE)
table(train2$deck)

   A    B    C    D    E    F    G    T 
 256  779  747  478  876 2794 2559    5 
table(train2$side)

   P    S 
4206 4288 
deckbilgisi <- train2 %>% count(deck, HomePlanet) %>% group_by(deck) %>% mutate(pay = n/sum(n))
train4 <- train
train4 <- train4 %>% separate(PassengerId, sep = "_", into = c("Aile", "KisiNo"), remove = FALSE) 
cryosleep <- train4 %>% group_by(Aile) %>%
  summarise(
    n = n(),
    bosolansayisi = sum(is.na(CryoSleep)),
    bosolmayan = sum(!is.na(CryoSleep)),
    farkli = sum(n_distinct(CryoSleep,na.rm = T))
  )

Mod Bulma

unique(train4$HomePlanet)
[1] "Europa" "Earth"  "Mars"   NA      
table(train4$HomePlanet)

 Earth Europa   Mars 
  4602   2131   1759 
dogru_olanlar <- sum(train$CryoSleep == TRUE,na.rm = T)
yanlis_olanlar <- sum(train$CryoSleep == FALSE,na.rm = T)

mod_cryosleep <- function(x,y){
  ifelse(
    x>y,
    x,
    y
  )
}

mod_cryosleep(dogru_olanlar,yanlis_olanlar)
[1] 5439
mean(train4$CryoSleep)
[1] NA
train4 <- train4 %>% 
  mutate(
    CryoSleep = if_else(is.na(CryoSleep),FALSE,CryoSleep)
  )
unique(train4$CryoSleep)
[1] FALSE  TRUE
# home_planet_earth <- train4$HomePlanet == "Earth"
# home_planet_europa <- train4$HomePlanet == "Europa"
# home_planet_mars <- train4$HomePlanet == "Mars"
# home_planet_na <- train4$HomePlanet == "NA"

Home planet

inceleme4 <- train4 %>% group_by(Aile) %>%
  summarise(
    uye_s = n(),
    bos_sayisi = sum(is.na(HomePlanet)),
    bosolamyan_sayisi = sum(!is.na(HomePlanet)),
    farkli_gezegen = n_distinct(HomePlanet,na.rm = TRUE),
    gezegen_adi = unique(HomePlanet[!is.na(HomePlanet)])
  ) 
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'Aile'. You can override using the
`.groups` argument.
train4 <- left_join(train4,inceleme4)
Joining with `by = join_by(Aile)`
train4 <- train4 %>% mutate(
  HomePlanet = ifelse(
    bos_sayisi != 0,
    gezegen_adi,
    HomePlanet
  )
)
table(train4$HomePlanet)

 Earth Europa   Mars 
  4634   2161   1787 
train4 <- train4 %>% mutate(
  HomePlanet = ifelse(
    is.na(HomePlanet),
    "Earth",
    HomePlanet
  )
)

Age

mean(train4$Age,na.rm = T)
[1] 28.82793
train4 <- train4 %>% mutate(
  Age = ifelse(
    is.na(Age),
    mean(train4$Age,na.rm = T),
    Age
  )
)

Hem homeplanet ve cryosleep

earth_cryosleep <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T)

europe_cryosleep <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T)

mars_cryosleep <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T)

####2 Adim

earth_cryosleep_f <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F)

europe_cryosleep_f <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F)

mars_cryosleep_f <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F)

homeplanet,cryosleep,cabin

train4 <- train4 %>% separate(Cabin, into = c('deck','num','side'), sep = '/', remove = FALSE)
unique(train4$deck)
[1] "B" "F" "A" "G" NA  "E" "D" "C" "T"
###earth

deck_earth1 <- sum(train4$HomePlanet == "Earth" & train4$deck == "B")
deck_earth2 <- sum(train4$HomePlanet == "Earth" & train4$deck == "F")
deck_earth3 <- sum(train4$HomePlanet == "Earth" & train4$deck == "A")
deck_earth4 <- sum(train4$HomePlanet == "Earth" & train4$deck == "G") ##bu
deck_earth5 <- sum(train4$HomePlanet == "Earth" & train4$deck == "E")
deck_earth6 <- sum(train4$HomePlanet == "Earth" & train4$deck == "D")
deck_earth7 <- sum(train4$HomePlanet == "Earth" & train4$deck == "C")
deck_earth8 <- sum(train4$HomePlanet == "Earth" & train4$deck == "T")

###europ

deck_europ1 <- sum(train4$HomePlanet == "Europa" & train4$deck == "B") ##Bu
deck_europ2 <- sum(train4$HomePlanet == "Europa" & train4$deck == "F")
deck_europ3 <- sum(train4$HomePlanet == "Europa" & train4$deck == "A")
deck_europ4 <- sum(train4$HomePlanet == "Europa" & train4$deck == "G")
deck_europ5 <- sum(train4$HomePlanet == "Europa" & train4$deck == "E")
deck_europ6 <- sum(train4$HomePlanet == "Europa" & train4$deck == "D")
deck_europ7 <- sum(train4$HomePlanet == "Europa" & train4$deck == "C")
deck_europ8 <- sum(train4$HomePlanet == "Europa" & train4$deck == "T")

###mars

deck_mars1 <- sum(train4$HomePlanet == "Mars" & train4$deck == "B")
deck_mars2 <- sum(train4$HomePlanet == "Mars" & train4$deck == "F") ##Bu
deck_mars3 <- sum(train4$HomePlanet == "Mars" & train4$deck == "A")
deck_mars4 <- sum(train4$HomePlanet == "Mars" & train4$deck == "G")
deck_mars5 <- sum(train4$HomePlanet == "Mars" & train4$deck == "E")
deck_mars6 <- sum(train4$HomePlanet == "Mars" & train4$deck == "D")
deck_mars7 <- sum(train4$HomePlanet == "Mars" & train4$deck == "C")
deck_mars7 <- sum(train4$HomePlanet == "Mars" & train4$deck == "T")

1

earth_cryosleep_cabin <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "B",na.rm = T)

europe_cryosleep_cabin <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "B",na.rm = T)

mars_cryosleep_cabin <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "B",na.rm = T)

2

earth_cryosleep_cabin1 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "A",na.rm = T)

europe_cryosleep_cabin1 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "A",na.rm = T)

mars_cryosleep_cabin1 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "A",na.rm = T)

3

earth_cryosleep_cabin2 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "F",na.rm = T)

europe_cryosleep_cabin2 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "F",na.rm = T)

mars_cryosleep_cabin2 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "F",na.rm = T)

4

earth_cryosleep_cabin3 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "G",na.rm = T)

europe_cryosleep_cabin3 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "G",na.rm = T)

mars_cryosleep_cabin3 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "G",na.rm = T)

5

earth_cryosleep_cabin4 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "E",na.rm = T)

europe_cryosleep_cabin4 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "E",na.rm = T)

mars_cryosleep_cabin4 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "E",na.rm = T)

6

earth_cryosleep_cabin5 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "D",na.rm = T)

europe_cryosleep_cabin5 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "D",na.rm = T)

mars_cryosleep_cabin5 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "D",na.rm = T)

7

earth_cryosleep_cabin6 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "C",na.rm = T)

europe_cryosleep_cabin6 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "C",na.rm = T)

mars_cryosleep_cabin6 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "C",na.rm = T)

8

earth_cryosleep_cabin7 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == T & train4$deck == "T",na.rm = T)

europe_cryosleep_cabin7 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == T & train4$deck == "T",na.rm = T)

mars_cryosleep_cabin7 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == T & train4$deck == "T",na.rm = T)

Cabin_deck

###1adim

train4 <- train4 %>% mutate(
  deck = ifelse(
    HomePlanet == "Earth" & CryoSleep == T & is.na(deck),
    "G",
    deck
  )
)


train4 <- train4 %>% mutate(
  deck = ifelse(
    HomePlanet == "Europa" & CryoSleep == T & is.na(deck),
    "B",
    deck
  )
)

train4 <- train4 %>% mutate(
  deck = ifelse(
    HomePlanet == "Mars" & CryoSleep == T & is.na(deck),
    "F",
    deck
  )
)




####2adim

train4 <- train4 %>% mutate(
  deck = ifelse(
    HomePlanet == "Earth" & CryoSleep == F & is.na(deck),
    "F",
    deck
  )
)


train4 <- train4 %>% mutate(
  deck = ifelse(
    HomePlanet == "Europa" & CryoSleep == F & is.na(deck),
    "C",
    deck
  )
)

train4 <- train4 %>% mutate(
  deck = ifelse(
    HomePlanet == "Mars" & CryoSleep == F & is.na(deck),
    "F",
    deck
  )
)

9

earth_cryosleep_cabin9 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "B",na.rm = T)

europe_cryosleep_cabin9 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "B",na.rm = T)

mars_cryosleep_cabin9 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "B",na.rm = T)

10

earth_cryosleep_cabin10 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "A",na.rm = T)

europe_cryosleep_cabin10 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "A",na.rm = T)

mars_cryosleep_cabin10 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "A",na.rm = T)

11

earth_cryosleep_cabin11 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "F",na.rm = T)

europe_cryosleep_cabin11 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "F",na.rm = T)

mars_cryosleep_cabin11 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "F",na.rm = T)

12

earth_cryosleep_cabin12 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "G",na.rm = T)

europe_cryosleep_cabin12 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "G",na.rm = T)

mars_cryosleep_cabin12 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "G",na.rm = T)

13

earth_cryosleep_cabin13 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "E",na.rm = T)

europe_cryosleep_cabin13 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "E",na.rm = T)

mars_cryosleep_cabin13 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "E",na.rm = T)

14

earth_cryosleep_cabin14 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "D",na.rm = T)

europe_cryosleep_cabin14 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "D",na.rm = T)

mars_cryosleep_cabin14 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "D",na.rm = T)

15

earth_cryosleep_cabin15 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "C",na.rm = T)

europe_cryosleep_cabin15 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "C",na.rm = T)

mars_cryosleep_cabin15 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "C",na.rm = T)

16

earth_cryosleep_cabin16 <- sum(train4$HomePlanet == "Earth" & train4$CryoSleep == F & train4$deck == "T",na.rm = T)

europe_cryosleep_cabin16 <- sum(train4$HomePlanet == "Europa" & train4$CryoSleep == F & train4$deck == "T",na.rm = T)

mars_cryosleep_cabin16 <- sum(train4$HomePlanet == "Mars" & train4$CryoSleep == F & train4$deck == "T",na.rm = T)

side

unique(train4$side)
[1] "P" "S" NA 
###earthside
side_earth1 <- sum(train4$HomePlanet == "Earth" & train4$side == "P",na.rm = T)
side_earth2 <- sum(train4$HomePlanet == "Earth" & train4$side == "S",na.rm = T)


###Europaside
side_europa1 <- sum(train4$HomePlanet == "Europa" & train4$side == "P",na.rm = T)
side_europa2 <- sum(train4$HomePlanet == "Europa" & train4$side == "S",na.rm = T)

###Marsside
side_mars1 <- sum(train4$HomePlanet == "Mars" & train4$side == "P",na.rm = T)
side_mars2 <- sum(train4$HomePlanet == "Mars" & train4$side == "S",na.rm = T)
train4 <- train4 %>% mutate(
  side = ifelse(HomePlanet == "Earth" & is.na(side),
                "P",
                side
                ),
  side = ifelse(HomePlanet == "Europa" & is.na(side),
                "S",
                side
                ),
  side = ifelse(HomePlanet == "Mars" & is.na(side),
                "P",
                side
                )
)

Roomservice

1)CryoSleep == FALSE ise Roomsevice,Foodcourt ve baskalari 0 dir

2)Servis kullanabilmek icin belli bir yasdan buyuk olmasi gerekiyor Age <= 12 ise Roomsevice,Foodcourt ve baskalari 0 dir

library(ggplot2)

ggplot(train4,aes(RoomService)) + 
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 181 rows containing non-finite outside the scale range
(`stat_bin()`).

train4 <- train4 %>% mutate(
  RoomService = ifelse(is.na(RoomService),0,RoomService),
  FoodCourt = ifelse(is.na(FoodCourt),0,FoodCourt),
  ShoppingMall = 
    ifelse(is.na(ShoppingMall),0,ShoppingMall),
  Spa = ifelse(is.na(Spa),0,Spa),
  VRDeck = ifelse(is.na(VRDeck),0,VRDeck)
)

VIP

ggplot(train4, aes(VIP)) +
  geom_bar()

vip_01 <- sum(train4$uye_s > 1 & train4$VIP == TRUE , na.rm = T)
vip_02 <- sum(train4$uye_s > 1 & train4$VIP == FALSE , na.rm = T)

train4 <- train4 %>% group_by(uye_s) %>%          # aynı gruptaki üyeleri birlikte ele al
  mutate(VIP = ifelse(is.na(VIP) & uye_s > 1, 
                      first(na.omit(VIP)),  # gruptaki boş olmayan ilk değeri al
                      VIP),
         ) 
train4 <- train4 %>%
  group_by(HomePlanet, Destination) %>%
  mutate(
    VIP = ifelse(is.na(VIP),
                 if(length(VIP[!is.na(VIP)]) > 0) sample(VIP[!is.na(VIP)], 1) else NA,
                 VIP)
  )

# train4 <- train4 %>% 
#   group_by(HomePlanet, Destination, Name) %>% 
#   mutate( 
#     VIP = ifelse(is.na(VIP), sample(VIP[!is.na(VIP)], 1), VIP) 
#           )

Name

Tekrarlanan adlar ve soyadlar varmi

any(duplicated(train$Name))
[1] TRUE

Kac tane tekrar eden deger var

sum(duplicated(train$Name))
[1] 219

Bunlar hangi degerler

unique(train$Name[duplicated(train$Name)])
 [1] NA                   "Elaney Webstephrey" "Sharie Gallenry"   
 [4] "Gollux Reedall"     "Ankalik Nateansive" "Grake Porki"       
 [7] "Gwendy Sykess"      "Troya Schwardson"   "Glenna Valezaley"  
[10] "Apix Wala"          "Sus Coolez"         "Dia Cartez"        
[13] "Keitha Josey"       "Cuses Pread"        "Alraium Disivering"
[16] "Juane Popelazquez"  "Loree Wolfernan"    "Carry Contrevins"  
[19] "Asch Stradick"      "Glena Hahnstonsen"  "Anton Woody"       

Herbirinin kac defadan tekrar ettigini ogrenmek

name_cok_olanlar <- table(train$Name)
sort(name_cok_olanlar[name_cok_olanlar > 1], decreasing = T)

Alraium Disivering Ankalik Nateansive        Anton Woody          Apix Wala 
                 2                  2                  2                  2 
     Asch Stradick   Carry Contrevins        Cuses Pread         Dia Cartez 
                 2                  2                  2                  2 
Elaney Webstephrey  Glena Hahnstonsen   Glenna Valezaley     Gollux Reedall 
                 2                  2                  2                  2 
       Grake Porki      Gwendy Sykess  Juane Popelazquez       Keitha Josey 
                 2                  2                  2                  2 
   Loree Wolfernan    Sharie Gallenry         Sus Coolez   Troya Schwardson 
                 2                  2                  2                  2 

Soyad

# train4 <- train4 %>% group_by(uye_s) %>%          # aynı gruptaki üyeleri birlikte ele al
#   mutate(VIP = ifelse(is.na(VIP) & uye_s > 1, 
#                       first(na.omit(VIP)),  # gruptaki boş olmayan ilk değeri al
#                       VIP),
#          ) 

mod bulma

genel_mod <- function(x){
  nx <- table(x)
  mode_value <- names(nx)[nx == max(nx)]
  
  print(mode_value)
}

Destination

nrow(train4)
[1] 8693
unique(train4$Destination)
[1] "TRAPPIST-1e"   "PSO J318.5-22" "55 Cancri e"   NA             
table(train4$Destination)

  55 Cancri e PSO J318.5-22   TRAPPIST-1e 
         1800           796          5915 
genel_mod(train4$Destination)
[1] "TRAPPIST-1e"
ggplot(train4, aes(Destination)) +
  geom_bar()

destination olasilik gore doldurma

# set.seed(123)  # Tekrar edilebilirlik için
# 
# # 1. Eksik değerlerin indeksini bul
# na_index <- which(is.na(train4$Destination))
# 
# # 2. Mevcut değerlerin frekansını al ve olasılıkları hesapla
# freq <- table(train4$Destination)
# prob <- freq / sum(freq)
# 
# # 3. Eksik değerleri rastgele doldur
# train4$Destination[na_index] <- sample(names(prob), 
#                                    size = length(na_index), 
#                                    replace = TRUE, 
#                                    prob = prob)
# 
# # 4. Sonucu kontrol et
# 
# table(train4$Destination)
unique(train4[,c("HomePlanet","Destination")])
# A tibble: 12 × 2
# Groups:   HomePlanet, Destination [12]
   HomePlanet Destination  
   <chr>      <chr>        
 1 Europa     TRAPPIST-1e  
 2 Earth      TRAPPIST-1e  
 3 Earth      PSO J318.5-22
 4 Europa     55 Cancri e  
 5 Mars       TRAPPIST-1e  
 6 Mars       55 Cancri e  
 7 Earth      55 Cancri e  
 8 Mars       <NA>         
 9 Mars       PSO J318.5-22
10 Earth      <NA>         
11 Europa     PSO J318.5-22
12 Europa     <NA>         
train4 <- train4 %>%
  group_by(HomePlanet, VIP) %>%
  mutate(
    Destination = ifelse(is.na(Destination),
                 if(length(Destination[!is.na(Destination)]) > 0) sample(Destination[!is.na(Destination)], 1) else NA,
                 Destination)
  )

Iliskileri gormek

library(mice)

Присоединяю пакет: 'mice'
Следующий объект скрыт от 'package:stats':

    filter
Следующие объекты скрыты от 'package:base':

    cbind, rbind
##md.pattern(train)  # Eksik veri desenini tablo halinde gösterir
table(train$HomePlanet, is.na(train$CryoSleep))
        
         FALSE TRUE
  Earth   4488  114
  Europa  2073   58
  Mars    1716   43
table(train$HomePlanet, is.na(train$Destination))
        
         FALSE TRUE
  Earth   4503   99
  Europa  2094   37
  Mars    1717   42
table(train$HomePlanet, is.na(train$VIP))
        
         FALSE TRUE
  Earth   4487  115
  Europa  2089   42
  Mars    1716   43
table(train$HomePlanet, is.na(train$Cabin))
        
         FALSE TRUE
  Earth   4507   95
  Europa  2070   61
  Mars    1722   37
colSums(is.na(train))
 PassengerId   HomePlanet    CryoSleep        Cabin  Destination          Age 
           0          201          217          199          182          179 
         VIP  RoomService    FoodCourt ShoppingMall          Spa       VRDeck 
         203          181          183          208          183          188 
        Name  Transported 
         200            0 
str(train$HomePlanet)
 chr [1:8693] "Europa" "Earth" "Europa" "Europa" "Earth" "Earth" "Earth" ...
table(train$HomePlanet)

 Earth Europa   Mars 
  4602   2131   1759 
library(VIM)

aggr(train, numbers = TRUE, prop = TRUE, sortVars = TRUE)
Warning in plot.aggr(res, ...): not enough vertical space to display
frequencies (too many combinations)


 Variables sorted by number of missings: 
     Variable      Count
    CryoSleep 0.02496261
 ShoppingMall 0.02392730
          VIP 0.02335212
   HomePlanet 0.02312205
         Name 0.02300702
        Cabin 0.02289198
       VRDeck 0.02162660
    FoodCourt 0.02105142
          Spa 0.02105142
  Destination 0.02093639
  RoomService 0.02082135
          Age 0.02059128
  PassengerId 0.00000000
  Transported 0.00000000
library(naniar)

Присоединяю пакет: 'naniar'
Следующий объект скрыт от 'package:explore':

    replace_na_with
gg_miss_upset(train, nsets = ncol(train))
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the UpSetR package.
  Please report the issue to the authors.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the UpSetR package.
  Please report the issue to the authors.
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
ℹ The deprecated feature was likely used in the UpSetR package.
  Please report the issue to the authors.

##miss_case_summary(train)

Cabin

train4 <- train4 %>%
  group_by(HomePlanet) %>%
  mutate(
    Cabin = ifelse(is.na(Cabin),
                 if(length(Cabin[!is.na(Cabin)]) > 0) sample(Cabin[!is.na(Cabin)], 1) else NA,
                 Cabin)
  )

Cryoleep

Age < 10 & Cryosleep == TRUE olanlar kac tane ve Age < 10 kucuk olanlar hepsi uyuyormu Ve bunlar annesi ile birliktemi

RoomService,foodcourt,ShoppingMall ve baska herhangi biri == 0 ise CryoSleep == FALSE RoomService,foodcourt,ShoppingMall > 0 ise Cryosleep == TRUE’dir

Age

Age == 0 olanlar roomservice/foodcourt/spa === 0

sum(train$Age == 12 & train$RoomService == 0)
[1] NA

Name ve num cikarma islemi

train5 <- train4 %>% select( -Aile, -KisiNo, -Name, -num, -deck, -side, -uye_s, -gezegen_adi, -bos_sayisi, -bosolamyan_sayisi, -farkli_gezegen)


describe_all(train5)
# A tibble: 13 × 8
   variable     type     na na_pct unique   min   mean   max
   <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
 1 PassengerId  chr       0      0   8693    NA  NA       NA
 2 HomePlanet   chr       0      0      3    NA  NA       NA
 3 CryoSleep    lgl       0      0      2     0   0.35     1
 4 Cabin        chr       0      0   6560    NA  NA       NA
 5 Destination  chr       0      0      3    NA  NA       NA
 6 Age          dbl       0      0     81     0  28.8     79
 7 VIP          lgl       0      0      2     0   0.03     1
 8 RoomService  dbl       0      0   1273     0 220.   14327
 9 FoodCourt    dbl       0      0   1507     0 448.   29813
10 ShoppingMall dbl       0      0   1115     0 170.   23492
11 Spa          dbl       0      0   1327     0 305.   22408
12 VRDeck       dbl       0      0   1306     0 298.   24133
13 Transported  lgl       0      0      2     0   0.5      1

Faktora cevirme

# train5$PassengerId <- as.factor(train5$PassengerId)
# train5$HomePlanet <- as.factor(train5$HomePlanet)
# train5$Destination <- as.factor(train5$Destination)
# train5$Cabin <- as.factor(train5$Cabin)

datayi ikiye boluyorum

# Örnek veri seti
set.seed(123)  # Sonuçların tekrar üretilebilir olması için

# Satır sayısı
n <- nrow(train5)

# 0.75 ve 0.25 oranında rastgele satır seçimi
train5_index <- sample(1:n, size = 0.75 * n)

# Sonuçları kontrol
length(train5_index)  # ~75
[1] 6519
train5_train <- train5[train5_index, ]
train5_test  <- train5[-train5_index, ]

nrow(train5_train)  # 75% satır
[1] 6519
nrow(train5_test)   # 25% satır
[1] 2174
train5_test_beta <- train5_test %>% select( -Transported)

describe_all(train5_train)
# A tibble: 13 × 8
   variable     type     na na_pct unique   min   mean   max
   <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
 1 PassengerId  chr       0      0   6519    NA  NA       NA
 2 HomePlanet   chr       0      0      3    NA  NA       NA
 3 CryoSleep    lgl       0      0      2     0   0.35     1
 4 Cabin        chr       0      0   5153    NA  NA       NA
 5 Destination  chr       0      0      3    NA  NA       NA
 6 Age          dbl       0      0     81     0  28.8     79
 7 VIP          lgl       0      0      2     0   0.03     1
 8 RoomService  dbl       0      0   1086     0 221.    9920
 9 FoodCourt    dbl       0      0   1239     0 458.   29813
10 ShoppingMall dbl       0      0    962     0 176.   23492
11 Spa          dbl       0      0   1104     0 311.   22408
12 VRDeck       dbl       0      0   1095     0 301.   24133
13 Transported  lgl       0      0      2     0   0.5      1

sonra factor seviyelerini koru

# for (col in names(train5_train)) {
#   if (is.factor(train5_train[[col]])) {
#     train5_test[[col]] <- factor(train5_test[[col]], levels = levels(train5_train[[col]]))
#   }
# }

Regresyon Modelleri olustur

library(recipes)

Присоединяю пакет: 'recipes'
Следующий объект скрыт от 'package:VIM':

    prepare
Следующий объект скрыт от 'package:stringr':

    fixed
Следующий объект скрыт от 'package:stats':

    step
# 1️⃣ Recipe oluştur, tüm kategorik değişkenlerde yeni seviyeleri "other" yap
rec <- recipe(Transported ~ . , data = train5_train) %>%
       step_other(all_nominal(), threshold = 0.01)  # %1'den az görülen veya yeni seviyeler "other"

# 2️⃣ Recipe'i eğit
rec_prep <- prep(rec, training = train5_train)

# 3️⃣ Train ve Test setlerini dönüştür
train_transformed <- bake(rec_prep, new_data = train5_train)
test_transformed  <- bake(rec_prep, new_data = train5_test)
model1 <- lm(Transported ~ . - PassengerId, train_transformed)
summary(model1)

Call:
lm(formula = Transported ~ . - PassengerId, data = train_transformed)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.27878 -0.31057 -0.00527  0.29386  1.68061 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.743e-01  4.947e-02   7.565 4.40e-14 ***
HomePlanetEuropa          2.569e-01  1.497e-02  17.157  < 2e-16 ***
HomePlanetMars            1.102e-01  1.406e-02   7.838 5.33e-15 ***
CryoSleepTRUE             3.713e-01  1.241e-02  29.925  < 2e-16 ***
Cabinother                6.577e-02  4.778e-02   1.377   0.1687    
DestinationPSO J318.5-22 -4.102e-02  2.055e-02  -1.996   0.0459 *  
DestinationTRAPPIST-1e   -5.650e-02  1.300e-02  -4.347 1.40e-05 ***
Age                      -2.248e-03  3.668e-04  -6.128 9.40e-10 ***
VIPTRUE                  -5.653e-02  3.281e-02  -1.723   0.0850 .  
RoomService              -1.189e-04  8.363e-06 -14.222  < 2e-16 ***
FoodCourt                 4.103e-05  3.544e-06  11.576  < 2e-16 ***
ShoppingMall              7.198e-05  8.359e-06   8.612  < 2e-16 ***
Spa                      -8.225e-05  4.669e-06 -17.617  < 2e-16 ***
VRDeck                   -8.278e-05  4.851e-06 -17.065  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4092 on 6505 degrees of freedom
Multiple R-squared:  0.3317,    Adjusted R-squared:  0.3303 
F-statistic: 248.3 on 13 and 6505 DF,  p-value: < 2.2e-16
# library(parsnip)
# lm_model <- linear_reg() %>%
#     set_engine('lm') %>%
#     set_mode('regression')

# lm_fit <- lm_model %>%
#     fit(Transported ~ . - PassengerId, data = train5)
# 2️⃣ glm modeli (lojistik regresyon)
# Transported TRUE/FALSE ise family = binomial kullanılır
# glm_model <- glm(Transported ~ . - PassengerId, data = train5, family = binomial() )
# summary(glm_model)
# library(biglm)
# 
# # PassengerId hariç tüm değişkenleri kullanmak istiyorsan:
# model <- bigglm(Transported ~ - PassengerId, train5)
# 
# summary(model)
# 
# model1 <- biglm(Transported ~ - PassengerId, train5)
# 
# summary(model)

Model kurma

# Tahmin (probability)
train5_test$beta_pred <- predict(model1, newdata = test_transformed)


# 0.5 eşik ile TRUE/FALSE tahmini
beta_pred_class <- ifelse(train5_test$beta_pred > 0.5, TRUE, FALSE)


# Doğruluk
accuracy <- mean(beta_pred_class == train5_test$Transported)
accuracy_wrong = mean(beta_pred_class != train5_test$Transported)

sum(accuracy_wrong * 100)
[1] 22.26311
sum(accuracy * 100)
[1] 77.73689
describe_all(train5)
# A tibble: 13 × 8
   variable     type     na na_pct unique   min   mean   max
   <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
 1 PassengerId  chr       0      0   8693    NA  NA       NA
 2 HomePlanet   chr       0      0      3    NA  NA       NA
 3 CryoSleep    lgl       0      0      2     0   0.35     1
 4 Cabin        chr       0      0   6560    NA  NA       NA
 5 Destination  chr       0      0      3    NA  NA       NA
 6 Age          dbl       0      0     81     0  28.8     79
 7 VIP          lgl       0      0      2     0   0.03     1
 8 RoomService  dbl       0      0   1273     0 220.   14327
 9 FoodCourt    dbl       0      0   1507     0 448.   29813
10 ShoppingMall dbl       0      0   1115     0 170.   23492
11 Spa          dbl       0      0   1327     0 305.   22408
12 VRDeck       dbl       0      0   1306     0 298.   24133
13 Transported  lgl       0      0      2     0   0.5      1