Titanic projesi

Veri seti açıklayın

veri dosayasi.

Veri seti, belirli bir amaca yönelik olarak toplanmış ve organize edilmiş verilerin bir koleksiyonudur. Genellikle bilgisayar algoritmalarını eğitmek, test etmek veya analiz etmek için kullanılırlar. Veri setleri, çeşitli kaynaklardan gelir ve genellikle tablo, metin dosyası veya veritabanı gibi yapılandırılmış formatlarda depolanır.

Veri setleri, makine öğrenimi ve yapay zeka gibi alanlarda çok önemlidir çünkü bu teknolojilerin etkili bir şekilde çalışabilmesi için büyük miktarda veriye ihtiyaçları vardır. Veri setleri, özellikle belirli bir görev veya probleme odaklanan veri bilimcileri ve araştırmacılar için değerli bir araçtır.

library(readr)
test <- read_csv("test.csv")
## Rows: 4277 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): PassengerId, HomePlanet, Cabin, Destination, Name
## dbl (6): Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
## lgl (2): CryoSleep, VIP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train <- read_csv("train.csv")
## Rows: 8693 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): PassengerId, HomePlanet, Cabin, Destination, Name
## dbl (6): Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
## lgl (3): CryoSleep, VIP, Transported
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(rmarkdown)
paged_table(test)
paged_table(train)

PassengerId - Her yolcu için benzersiz bir Kimlik. Her Kimlik gggg_pp biçimini alır; burada gggg yolcunun birlikte seyahat ettiği grubu belirtir ve pp grup içindeki numaradır. Bir gruptaki insanlar çoğunlukla aile üyeleridir, ancak her zaman değil.

HomePlanet - Yolcunun ayrıldığı gezegen, genellikle daimi ikamet ettikleri gezegen.

CryoSleep - Yolcunun yolculuk süresince askıya alınmış animasyona alınmayı seçip seçmediğini belirtir. Dondurucu uykudaki yolcular kabinlerine hapsedilir.

Cabin - Yolcunun kaldığı kabin numarası. Güverte/numara/yan formunu alır; burada taraf, İskele için P veya Sancak için S olabilir.

Destination : Yolcunun ineceÄŸi gezegen.

Age* - Yolcunun yaşı.

VIP - Yolcunun yolculuk sırasında özel VIP hizmeti için ödeme yapıp yapmadığı.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Yolcunun, Uzay Gemisi Titanic’in birçok lüks olanağının her birinde fatura ettiği tutar.

Name - Yolcunun adı ve soyadı.

Transported - Yolcunun başka bir boyuta taşınıp taşınmadığı. Bu hedeftir, tahmin etmeye çalıştığınız sütundur.

test.csv - Yolcuların geri kalan üçte birinin (~4300) kişisel kayıtları, test verisi olarak kullanılacak. Göreviniz bu setteki yolcular için Taşınan değerini tahmin etmektir.

sample_submission.csv - Doğru formatta bir gönderim dosyası. PassengerId - Test setindeki her yolcunun kimliği. Taşınan - Hedef. Her yolcu için Doğru veya Yanlış’ı tahmin edin.

PassengerId - Test setindeki her yolcunun kimliÄŸi.

Transported - Hedef. Her yolcu için Doğru veya Yanlış’ı tahmin edin.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(explore)
test %>% describe_all()
## # A tibble: 13 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  chr       0    0     4277    NA  NA       NA
##  2 HomePlanet   chr      87    2        4    NA  NA       NA
##  3 CryoSleep    lgl      93    2.2      3     0   0.37     1
##  4 Cabin        chr     100    2.3   3266    NA  NA       NA
##  5 Destination  chr      92    2.2      4    NA  NA       NA
##  6 Age          dbl      91    2.1     80     0  28.7     79
##  7 VIP          lgl      93    2.2      3     0   0.02     1
##  8 RoomService  dbl      82    1.9    843     0 219.   11567
##  9 FoodCourt    dbl     106    2.5    903     0 439.   25273
## 10 ShoppingMall dbl      98    2.3    716     0 177.    8292
## 11 Spa          dbl     101    2.4    834     0 303.   19844
## 12 VRDeck       dbl      80    1.9    797     0 311.   22272
## 13 Name         chr      94    2.2   4177    NA  NA       NA
train %>% describe_all()
## # A tibble: 14 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  chr       0    0     8693    NA  NA       NA
##  2 HomePlanet   chr     201    2.3      4    NA  NA       NA
##  3 CryoSleep    lgl     217    2.5      3     0   0.36     1
##  4 Cabin        chr     199    2.3   6561    NA  NA       NA
##  5 Destination  chr     182    2.1      4    NA  NA       NA
##  6 Age          dbl     179    2.1     81     0  28.8     79
##  7 VIP          lgl     203    2.3      3     0   0.02     1
##  8 RoomService  dbl     181    2.1   1274     0 225.   14327
##  9 FoodCourt    dbl     183    2.1   1508     0 458.   29813
## 10 ShoppingMall dbl     208    2.4   1116     0 174.   23492
## 11 Spa          dbl     183    2.1   1328     0 311.   22408
## 12 VRDeck       dbl     188    2.2   1307     0 305.   24133
## 13 Name         chr     200    2.3   8474    NA  NA       NA
## 14 Transported  lgl       0    0        2     0   0.5      1
test[c('group', 'pp')] <- str_split_fixed(test$PassengerId,'-', 2)
train[c('group', 'pp')] <- str_split_fixed(train$PassengerId,'-', 2)
head(train[,c("PassengerId", "group", "pp")])
## # A tibble: 6 × 3
##   PassengerId group   pp   
##   <chr>       <chr>   <chr>
## 1 0001_01     0001_01 ""   
## 2 0002_01     0002_01 ""   
## 3 0003_01     0003_01 ""   
## 4 0003_02     0003_02 ""   
## 5 0004_01     0004_01 ""   
## 6 0005_01     0005_01 ""
test$withgroup <- ifelse(duplicated(test$group) | duplicated(test$group, fromLast = TRUE), 1,0)
train$withgroup <- ifelse(duplicated(train$group) | duplicated(train$group, fromLast = TRUE), 1,0)
head(test[,c("PassengerId", "group", "pp", "withgroup")])
## # A tibble: 6 × 4
##   PassengerId group   pp    withgroup
##   <chr>       <chr>   <chr>     <dbl>
## 1 0013_01     0013_01 ""            0
## 2 0018_01     0018_01 ""            0
## 3 0019_01     0019_01 ""            0
## 4 0021_01     0021_01 ""            0
## 5 0023_01     0023_01 ""            0
## 6 0027_01     0027_01 ""            0
head(train[,c("PassengerId", "group", "pp", "withgroup")])
## # A tibble: 6 × 4
##   PassengerId group   pp    withgroup
##   <chr>       <chr>   <chr>     <dbl>
## 1 0001_01     0001_01 ""            0
## 2 0002_01     0002_01 ""            0
## 3 0003_01     0003_01 ""            0
## 4 0003_02     0003_02 ""            0
## 5 0004_01     0004_01 ""            0
## 6 0005_01     0005_01 ""            0
train[c('deck', 'num', 'side')] <- str_split_fixed(train$Cabin, '/', 3)
test[c('deck', 'num', 'side')] <- str_split_fixed(test$Cabin, '/', 3)
test <-test %>% mutate_if(is.character,as.factor)
train <-train %>% mutate_if(is.character,as.factor)
summary(train)
##   PassengerId    HomePlanet   CryoSleep           Cabin     
##  0001_01:   1   Earth :4602   Mode :logical   G/734/S:   8  
##  0002_01:   1   Europa:2131   FALSE:5439      B/11/S :   7  
##  0003_01:   1   Mars  :1759   TRUE :3037      B/201/P:   7  
##  0003_02:   1   NA's  : 201   NA's :217       B/82/S :   7  
##  0004_01:   1                                 C/137/S:   7  
##  0005_01:   1                                 (Other):8458  
##  (Other):8687                                 NA's   : 199  
##         Destination        Age           VIP           RoomService     
##  55 Cancri e  :1800   Min.   : 0.00   Mode :logical   Min.   :    0.0  
##  PSO J318.5-22: 796   1st Qu.:19.00   FALSE:8291      1st Qu.:    0.0  
##  TRAPPIST-1e  :5915   Median :27.00   TRUE :199       Median :    0.0  
##  NA's         : 182   Mean   :28.83   NA's :203       Mean   :  224.7  
##                       3rd Qu.:38.00                   3rd Qu.:   47.0  
##                       Max.   :79.00                   Max.   :14327.0  
##                       NA's   :179                     NA's   :181      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  458.1   Mean   :  173.7   Mean   :  311.1   Mean   :  304.9  
##  3rd Qu.:   76.0   3rd Qu.:   27.0   3rd Qu.:   59.0   3rd Qu.:   46.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##  NA's   :183       NA's   :208       NA's   :183       NA's   :188      
##                  Name      Transported         group      pp        withgroup
##  Alraium Disivering:   2   Mode :logical   0001_01:   1   :8693   Min.   :0  
##  Ankalik Nateansive:   2   FALSE:4315      0002_01:   1           1st Qu.:0  
##  Anton Woody       :   2   TRUE :4378      0003_01:   1           Median :0  
##  Apix Wala         :   2                   0003_02:   1           Mean   :0  
##  Asch Stradick     :   2                   0004_01:   1           3rd Qu.:0  
##  (Other)           :8483                   0005_01:   1           Max.   :0  
##  NA's              : 200                   (Other):8687                      
##       deck           num       side    
##  F      :2794          : 199    : 199  
##  G      :2559   82     :  28   P:4206  
##  E      : 876   19     :  22   S:4288  
##  B      : 779   86     :  22           
##  C      : 747   176    :  21           
##  (Other): 739   56     :  21           
##  NA's   : 199   (Other):8380
test [test == ' '] <- NA
train [train == ' '] <- NA
train$num <- droplevels(train$num)
test$num <- droplevels(test$num)
train$side <- droplevels(train$side)
test$side <- droplevels(test$side)
summary(train)
##   PassengerId    HomePlanet   CryoSleep           Cabin     
##  0001_01:   1   Earth :4602   Mode :logical   G/734/S:   8  
##  0002_01:   1   Europa:2131   FALSE:5439      B/11/S :   7  
##  0003_01:   1   Mars  :1759   TRUE :3037      B/201/P:   7  
##  0003_02:   1   NA's  : 201   NA's :217       B/82/S :   7  
##  0004_01:   1                                 C/137/S:   7  
##  0005_01:   1                                 (Other):8458  
##  (Other):8687                                 NA's   : 199  
##         Destination        Age           VIP           RoomService     
##  55 Cancri e  :1800   Min.   : 0.00   Mode :logical   Min.   :    0.0  
##  PSO J318.5-22: 796   1st Qu.:19.00   FALSE:8291      1st Qu.:    0.0  
##  TRAPPIST-1e  :5915   Median :27.00   TRUE :199       Median :    0.0  
##  NA's         : 182   Mean   :28.83   NA's :203       Mean   :  224.7  
##                       3rd Qu.:38.00                   3rd Qu.:   47.0  
##                       Max.   :79.00                   Max.   :14327.0  
##                       NA's   :179                     NA's   :181      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  458.1   Mean   :  173.7   Mean   :  311.1   Mean   :  304.9  
##  3rd Qu.:   76.0   3rd Qu.:   27.0   3rd Qu.:   59.0   3rd Qu.:   46.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##  NA's   :183       NA's   :208       NA's   :183       NA's   :188      
##                  Name      Transported         group      pp        withgroup
##  Alraium Disivering:   2   Mode :logical   0001_01:   1   :8693   Min.   :0  
##  Ankalik Nateansive:   2   FALSE:4315      0002_01:   1           1st Qu.:0  
##  Anton Woody       :   2   TRUE :4378      0003_01:   1           Median :0  
##  Apix Wala         :   2                   0003_02:   1           Mean   :0  
##  Asch Stradick     :   2                   0004_01:   1           3rd Qu.:0  
##  (Other)           :8483                   0005_01:   1           Max.   :0  
##  NA's              : 200                   (Other):8687                      
##       deck           num       side    
##  F      :2794          : 199    : 199  
##  G      :2559   82     :  28   P:4206  
##  E      : 876   19     :  22   S:4288  
##  B      : 779   86     :  22           
##  C      : 747   176    :  21           
##  (Other): 739   56     :  21           
##  NA's   : 199   (Other):8380
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
test <- test %>% 
  group_by(group) %>% 
  mutate(HomePlanet = na.locf(HomePlanet, na.rm = FALSE))
train <- train %>% 
  group_by(group) %>% 
  mutate(HomePlanet = na.locf(HomePlanet, na.rm = FALSE))
most_frequent_hp <- train %>%
 filter(!is.na(HomePlanet)) %>%
 group_by(Destination, HomePlanet) %>%
 summarize(count = n()) %>%
 arrange(Destination, desc(count)) %>%
 slice(1) %>%
 ungroup()
## `summarise()` has grouped output by 'Destination'. You can override using the
## `.groups` argument.
most_frequent_hp
## # A tibble: 4 × 3
##   Destination   HomePlanet count
##   <fct>         <fct>      <int>
## 1 55 Cancri e   Europa       886
## 2 PSO J318.5-22 Earth        712
## 3 TRAPPIST-1e   Earth       3101
## 4 <NA>          Earth         99
train$HomePlanet <- as.character(train$HomePlanet)
train$Destination <- as.character(train$Destination)
train <- train %>%
  mutate(HomePlanet= ifelse(is.na(HomePlanet) & Destination == "55 Cancri e","Europa",
                            ifelse(is.na(HomePlanet), "Earth", HomePlanet)))
test$HomePlanet <- as.character(test$HomePlanet)
test$Destination <- as.character(test$Destination)
test <- test %>%
  mutate(HomePlanet= ifelse(is.na(HomePlanet) & Destination == "55 Cancri e","Europa",
                            ifelse(is.na(HomePlanet), "Earth", HomePlanet)))
train %>% describe_all()
## # A tibble: 20 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  fct       0    0     8693    NA  NA       NA
##  2 HomePlanet   chr       4    0        4    NA  NA       NA
##  3 CryoSleep    lgl     217    2.5      3     0   0.36     1
##  4 Cabin        fct     199    2.3   6561    NA  NA       NA
##  5 Destination  chr     182    2.1      4    NA  NA       NA
##  6 Age          dbl     179    2.1     81     0  28.8     79
##  7 VIP          lgl     203    2.3      3     0   0.02     1
##  8 RoomService  dbl     181    2.1   1274     0 225.   14327
##  9 FoodCourt    dbl     183    2.1   1508     0 458.   29813
## 10 ShoppingMall dbl     208    2.4   1116     0 174.   23492
## 11 Spa          dbl     183    2.1   1328     0 311.   22408
## 12 VRDeck       dbl     188    2.2   1307     0 305.   24133
## 13 Name         fct     200    2.3   8474    NA  NA       NA
## 14 Transported  lgl       0    0        2     0   0.5      1
## 15 group        fct       0    0     8693    NA  NA       NA
## 16 pp           fct       0    0        1    NA  NA       NA
## 17 withgroup    dbl       0    0        1     0   0        0
## 18 deck         fct     199    2.3      9    NA  NA       NA
## 19 num          fct       0    0     1818    NA  NA       NA
## 20 side         fct       0    0        3    NA  NA       NA
test %>% describe_all()
## # A tibble: 19 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  fct       0    0     4277    NA  NA       NA
##  2 HomePlanet   chr       2    0        4    NA  NA       NA
##  3 CryoSleep    lgl      93    2.2      3     0   0.37     1
##  4 Cabin        fct     100    2.3   3266    NA  NA       NA
##  5 Destination  chr      92    2.2      4    NA  NA       NA
##  6 Age          dbl      91    2.1     80     0  28.7     79
##  7 VIP          lgl      93    2.2      3     0   0.02     1
##  8 RoomService  dbl      82    1.9    843     0 219.   11567
##  9 FoodCourt    dbl     106    2.5    903     0 439.   25273
## 10 ShoppingMall dbl      98    2.3    716     0 177.    8292
## 11 Spa          dbl     101    2.4    834     0 303.   19844
## 12 VRDeck       dbl      80    1.9    797     0 311.   22272
## 13 Name         fct      94    2.2   4177    NA  NA       NA
## 14 group        fct       0    0     4277    NA  NA       NA
## 15 pp           fct       0    0        1    NA  NA       NA
## 16 withgroup    dbl       0    0        1     0   0        0
## 17 deck         fct     100    2.3      9    NA  NA       NA
## 18 num          fct       0    0     1506    NA  NA       NA
## 19 side         fct       0    0        3    NA  NA       NA
train <- transform(train, HomePlanet = replace(HomePlanet, is.na(HomePlanet),
"Earth"))
train <- transform(train, HomePlanet = replace(HomePlanet, is.na(HomePlanet), "Earth"))
most_frequent_destinations <- train %>%
 filter(!is.na(Destination)) %>%
 group_by(HomePlanet, Destination ) %>%
 summarize(count = n()) %>%
 arrange(HomePlanet, desc(count)) %>%
 slice(1) %>%
 ungroup()
## `summarise()` has grouped output by 'HomePlanet'. You can override using the
## `.groups` argument.
most_frequent_destinations
## # A tibble: 3 × 3
##   HomePlanet Destination count
##   <chr>      <chr>       <int>
## 1 Earth      TRAPPIST-1e  3251
## 2 Europa     TRAPPIST-1e  1189
## 3 Mars       TRAPPIST-1e  1475
test <- transform(test, Destination = replace(Destination, is.na(Destination), "TRAPPIST-1e"))
train <- transform(train, Destination = replace(Destination, is.na(Destination), "TRAPPIST-1e"))
test$HomePlanet <- as.factor(test$HomePlanet)
test$Destination <- as.factor(test$Destination)
train$HomePlanet <- as.factor(train$HomePlanet)
train$Destination <- as.factor(train$Destination)
summary(train)
##   PassengerId    HomePlanet   CryoSleep           Cabin     
##  0001_01:   1   Earth :4772   Mode :logical   G/734/S:   8  
##  0002_01:   1   Europa:2162   FALSE:5439      B/11/S :   7  
##  0003_01:   1   Mars  :1759   TRUE :3037      B/201/P:   7  
##  0003_02:   1                 NA's :217       B/82/S :   7  
##  0004_01:   1                                 C/137/S:   7  
##  0005_01:   1                                 (Other):8458  
##  (Other):8687                                 NA's   : 199  
##         Destination        Age           VIP           RoomService     
##  55 Cancri e  :1800   Min.   : 0.00   Mode :logical   Min.   :    0.0  
##  PSO J318.5-22: 796   1st Qu.:19.00   FALSE:8291      1st Qu.:    0.0  
##  TRAPPIST-1e  :6097   Median :27.00   TRUE :199       Median :    0.0  
##                       Mean   :28.83   NA's :203       Mean   :  224.7  
##                       3rd Qu.:38.00                   3rd Qu.:   47.0  
##                       Max.   :79.00                   Max.   :14327.0  
##                       NA's   :179                     NA's   :181      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  458.1   Mean   :  173.7   Mean   :  311.1   Mean   :  304.9  
##  3rd Qu.:   76.0   3rd Qu.:   27.0   3rd Qu.:   59.0   3rd Qu.:   46.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##  NA's   :183       NA's   :208       NA's   :183       NA's   :188      
##                  Name      Transported         group      pp        withgroup
##  Alraium Disivering:   2   Mode :logical   0001_01:   1   :8693   Min.   :0  
##  Ankalik Nateansive:   2   FALSE:4315      0002_01:   1           1st Qu.:0  
##  Anton Woody       :   2   TRUE :4378      0003_01:   1           Median :0  
##  Apix Wala         :   2                   0003_02:   1           Mean   :0  
##  Asch Stradick     :   2                   0004_01:   1           3rd Qu.:0  
##  (Other)           :8483                   0005_01:   1           Max.   :0  
##  NA's              : 200                   (Other):8687                      
##       deck           num       side    
##  F      :2794          : 199    : 199  
##  G      :2559   82     :  28   P:4206  
##  E      : 876   19     :  22   S:4288  
##  B      : 779   86     :  22           
##  C      : 747   176    :  21           
##  (Other): 739   56     :  21           
##  NA's   : 199   (Other):8380
train <- train %>%
mutate(RoomService = coalesce(RoomService, 0),
 FoodCourt = coalesce(FoodCourt, 0),
ShoppingMall = coalesce(ShoppingMall, 0),
Spa = coalesce(Spa, 0),
VRDeck = coalesce(VRDeck, 0))
test <- test %>%
mutate(RoomService = coalesce(RoomService, 0),
 FoodCourt = coalesce(FoodCourt, 0),
ShoppingMall = coalesce(ShoppingMall, 0),
Spa = coalesce(Spa, 0),
VRDeck = coalesce(VRDeck, 0))
test <- test %>%
group_by(HomePlanet, Destination) %>%
mutate_at(vars(Age), ~replace_na(., mean(., na.rm = TRUE)))
train <- train %>%
group_by(HomePlanet, Destination) %>%
mutate_at(vars(Age), ~replace_na(., mean(., na.rm = TRUE)))
train$expense <- train$RoomService + train$FoodCourt + train$ShoppingMall + train$Spa + train$VRDeck
test$expense <- test$RoomService + test$FoodCourt + test$ShoppingMall + test$Spa + test$VRDeck
train <- transform(train, CryoSleep = replace(CryoSleep, expense>0 & Age>12, "FALSE"))
test <- transform(test, CryoSleep = replace(CryoSleep, expense>0 & Age>12, "FALSE"))
describe_all(train)
## # A tibble: 21 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  fct       0    0     8693    NA  NA       NA
##  2 HomePlanet   fct       0    0        3    NA  NA       NA
##  3 CryoSleep    chr      98    1.1      3    NA  NA       NA
##  4 Cabin        fct     199    2.3   6561    NA  NA       NA
##  5 Destination  fct       0    0        3    NA  NA       NA
##  6 Age          dbl       0    0       88     0  28.8     79
##  7 VIP          lgl     203    2.3      3     0   0.02     1
##  8 RoomService  dbl       0    0     1273     0 220.   14327
##  9 FoodCourt    dbl       0    0     1507     0 448.   29813
## 10 ShoppingMall dbl       0    0     1115     0 170.   23492
## # ℹ 11 more rows
train$cryosleep <- as.factor(train$CryoSleep)
test$cryosleep <- as.factor(test$CryoSleep)
summary(train)
##   PassengerId    HomePlanet    CryoSleep             Cabin     
##  0001_01:   1   Earth :4772   Length:8693        G/734/S:   8  
##  0002_01:   1   Europa:2162   Class :character   B/11/S :   7  
##  0003_01:   1   Mars  :1759   Mode  :character   B/201/P:   7  
##  0003_02:   1                                    B/82/S :   7  
##  0004_01:   1                                    C/137/S:   7  
##  0005_01:   1                                    (Other):8458  
##  (Other):8687                                    NA's   : 199  
##         Destination        Age           VIP           RoomService   
##  55 Cancri e  :1800   Min.   : 0.00   Mode :logical   Min.   :    0  
##  PSO J318.5-22: 796   1st Qu.:20.00   FALSE:8291      1st Qu.:    0  
##  TRAPPIST-1e  :6097   Median :27.00   TRUE :199       Median :    0  
##                       Mean   :28.83   NA's :203       Mean   :  220  
##                       3rd Qu.:37.00                   3rd Qu.:   41  
##                       Max.   :79.00                   Max.   :14327  
##                                                                      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  448.4   Mean   :  169.6   Mean   :  304.6   Mean   :  298.3  
##  3rd Qu.:   61.0   3rd Qu.:   22.0   3rd Qu.:   53.0   3rd Qu.:   40.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##                                                                         
##                  Name      Transported         group      pp        withgroup
##  Alraium Disivering:   2   Mode :logical   0001_01:   1   :8693   Min.   :0  
##  Ankalik Nateansive:   2   FALSE:4315      0002_01:   1           1st Qu.:0  
##  Anton Woody       :   2   TRUE :4378      0003_01:   1           Median :0  
##  Apix Wala         :   2                   0003_02:   1           Mean   :0  
##  Asch Stradick     :   2                   0004_01:   1           3rd Qu.:0  
##  (Other)           :8483                   0005_01:   1           Max.   :0  
##  NA's              : 200                   (Other):8687                      
##       deck           num       side        expense      cryosleep   
##  F      :2794          : 199    : 199   Min.   :    0   FALSE:5558  
##  G      :2559   82     :  28   P:4206   1st Qu.:    0   TRUE :3037  
##  E      : 876   19     :  22   S:4288   Median :  716   NA's :  98  
##  B      : 779   86     :  22            Mean   : 1441               
##  C      : 747   176    :  21            3rd Qu.: 1441               
##  (Other): 739   56     :  21            Max.   :35987               
##  NA's   : 199   (Other):8380
train <- transform(train, VIP = replace(VIP, is.na(VIP), FALSE))
test <- transform(train, VIP = replace(VIP, is.na(VIP), FALSE))
train %>% describe_all()
## # A tibble: 22 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  fct       0    0     8693    NA  NA       NA
##  2 HomePlanet   fct       0    0        3    NA  NA       NA
##  3 CryoSleep    chr      98    1.1      3    NA  NA       NA
##  4 Cabin        fct     199    2.3   6561    NA  NA       NA
##  5 Destination  fct       0    0        3    NA  NA       NA
##  6 Age          dbl       0    0       88     0  28.8     79
##  7 VIP          lgl       0    0        2     0   0.02     1
##  8 RoomService  dbl       0    0     1273     0 220.   14327
##  9 FoodCourt    dbl       0    0     1507     0 448.   29813
## 10 ShoppingMall dbl       0    0     1115     0 170.   23492
## # ℹ 12 more rows
train <- train %>%
  group_by(group) %>%
mutate(deck = na.locf(deck, na.rm = FALSE))
test <- test %>%
  group_by(group) %>%
mutate(deck = na.locf(deck, na.rm = FALSE))
summary(train)
##   PassengerId    HomePlanet    CryoSleep             Cabin     
##  0001_01:   1   Earth :4772   Length:8693        G/734/S:   8  
##  0002_01:   1   Europa:2162   Class :character   B/11/S :   7  
##  0003_01:   1   Mars  :1759   Mode  :character   B/201/P:   7  
##  0003_02:   1                                    B/82/S :   7  
##  0004_01:   1                                    C/137/S:   7  
##  0005_01:   1                                    (Other):8458  
##  (Other):8687                                    NA's   : 199  
##         Destination        Age           VIP           RoomService   
##  55 Cancri e  :1800   Min.   : 0.00   Mode :logical   Min.   :    0  
##  PSO J318.5-22: 796   1st Qu.:20.00   FALSE:8494      1st Qu.:    0  
##  TRAPPIST-1e  :6097   Median :27.00   TRUE :199       Median :    0  
##                       Mean   :28.83                   Mean   :  220  
##                       3rd Qu.:37.00                   3rd Qu.:   41  
##                       Max.   :79.00                   Max.   :14327  
##                                                                      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  448.4   Mean   :  169.6   Mean   :  304.6   Mean   :  298.3  
##  3rd Qu.:   61.0   3rd Qu.:   22.0   3rd Qu.:   53.0   3rd Qu.:   40.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##                                                                         
##                  Name      Transported         group      pp        withgroup
##  Alraium Disivering:   2   Mode :logical   0001_01:   1   :8693   Min.   :0  
##  Ankalik Nateansive:   2   FALSE:4315      0002_01:   1           1st Qu.:0  
##  Anton Woody       :   2   TRUE :4378      0003_01:   1           Median :0  
##  Apix Wala         :   2                   0003_02:   1           Mean   :0  
##  Asch Stradick     :   2                   0004_01:   1           3rd Qu.:0  
##  (Other)           :8483                   0005_01:   1           Max.   :0  
##  NA's              : 200                   (Other):8687                      
##       deck           num       side        expense      cryosleep   
##  F      :2794          : 199    : 199   Min.   :    0   FALSE:5558  
##  G      :2559   82     :  28   P:4206   1st Qu.:    0   TRUE :3037  
##  E      : 876   19     :  22   S:4288   Median :  716   NA's :  98  
##  B      : 779   86     :  22            Mean   : 1441               
##  C      : 747   176    :  21            3rd Qu.: 1441               
##  (Other): 739   56     :  21            Max.   :35987               
##  NA's   : 199   (Other):8380
most_frequent_deck <- train %>%
filter(!is.na(deck)) %>%
group_by (HomePlanet, deck) %>%
summarize(count = n()) %>%
arrange(HomePlanet, desc(count)) %>%
slice (1) %>%
ungroup()
## `summarise()` has grouped output by 'HomePlanet'. You can override using the
## `.groups` argument.
most_frequent_deck
## # A tibble: 3 × 3
##   HomePlanet deck  count
##   <fct>      <fct> <int>
## 1 Earth      G      2553
## 2 Europa     B       771
## 3 Mars       F      1110
train$HomePlanet <- as.character(train$HomePlanet)
train$deck <- as.character(train$deck)
train <- train %>% 
   mutate(deck = ifelse(is.na(deck) & HomePlanet ==  "Earth", "G",
                       ifelse(is.na(deck) & HomePlanet == "Europa", "B",
                              ifelse(is.na(deck) & HomePlanet == "Mars", "F", deck))))
test$HomePlanet <- as.character(test$HomePlanet)
test$deck <- as.character(test$deck)
test <- test %>% 
   mutate(deck = ifelse(is.na(deck) & HomePlanet ==  "Earth", "G",
                       ifelse(is.na(deck) & HomePlanet == "Europa", "B",
                              ifelse(is.na(deck) & HomePlanet == "Mars", "F", deck))))
most_frequent_side <- train %>%
filter(!is.na(deck)) %>%
group_by (HomePlanet, deck) %>%
summarize(count = n()) %>%
arrange(HomePlanet, desc(count)) %>%
slice (1) %>%
ungroup()
## `summarise()` has grouped output by 'HomePlanet'. You can override using the
## `.groups` argument.
most_frequent_side
## # A tibble: 3 × 3
##   HomePlanet deck  count
##   <chr>      <chr> <int>
## 1 Earth      G      2652
## 2 Europa     B       834
## 3 Mars       F      1147
train$side <- as.character(train$side)
train <- train %>% 
   mutate(deck = ifelse(is.na(side) & HomePlanet ==  "Earth", "G",
                       ifelse(is.na(side) & HomePlanet == "Europa", "B",
                              ifelse(is.na(side) & HomePlanet == "Mars", "F", deck))))
test$side <- as.character(test$side)
test <- test %>% 
   mutate(deck = ifelse(is.na(side) & HomePlanet ==  "Earth", "G",
                       ifelse(is.na(side) & HomePlanet == "Europa", "B",
                              ifelse(is.na(side) & HomePlanet == "Mars", "F", deck))))
train %>% describe_all()
## # A tibble: 22 × 8
##    variable     type     na na_pct unique   min   mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>  <dbl> <dbl>
##  1 PassengerId  fct       0    0     8693    NA  NA       NA
##  2 HomePlanet   chr       0    0        3    NA  NA       NA
##  3 CryoSleep    chr      98    1.1      3    NA  NA       NA
##  4 Cabin        fct     199    2.3   6561    NA  NA       NA
##  5 Destination  fct       0    0        3    NA  NA       NA
##  6 Age          dbl       0    0       88     0  28.8     79
##  7 VIP          lgl       0    0        2     0   0.02     1
##  8 RoomService  dbl       0    0     1273     0 220.   14327
##  9 FoodCourt    dbl       0    0     1507     0 448.   29813
## 10 ShoppingMall dbl       0    0     1115     0 170.   23492
## # ℹ 12 more rows
train <- train %>% select(-c("Cabin", "Name", "group", "pp", "num"))
## Adding missing grouping variables: `group`
test <- test %>% select(-c("Cabin", "Name", "group", "pp", "num"))
## Adding missing grouping variables: `group`
describe_all(train)
## # A tibble: 18 × 8
##    variable     type     na na_pct unique   min    mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>   <dbl> <dbl>
##  1 group        fct       0    0     8693    NA   NA       NA
##  2 PassengerId  fct       0    0     8693    NA   NA       NA
##  3 HomePlanet   chr       0    0        3    NA   NA       NA
##  4 CryoSleep    chr      98    1.1      3    NA   NA       NA
##  5 Destination  fct       0    0        3    NA   NA       NA
##  6 Age          dbl       0    0       88     0   28.8     79
##  7 VIP          lgl       0    0        2     0    0.02     1
##  8 RoomService  dbl       0    0     1273     0  220.   14327
##  9 FoodCourt    dbl       0    0     1507     0  448.   29813
## 10 ShoppingMall dbl       0    0     1115     0  170.   23492
## 11 Spa          dbl       0    0     1327     0  305.   22408
## 12 VRDeck       dbl       0    0     1306     0  298.   24133
## 13 Transported  lgl       0    0        2     0    0.5      1
## 14 withgroup    dbl       0    0        1     0    0        0
## 15 deck         chr       0    0        8    NA   NA       NA
## 16 side         chr       0    0        3    NA   NA       NA
## 17 expense      dbl       0    0     2336     0 1441.   35987
## 18 cryosleep    fct      98    1.1      3    NA   NA       NA
train <- train %>% mutate_if(is.character,as.factor)
## `mutate_if()` ignored the following grouping variables:
## • Column `group`
test <- test %>% mutate_if(is.character,as.factor)
## `mutate_if()` ignored the following grouping variables:
## • Column `group`
train$PassengerId <- as.character(train$PassengerId)
test$PassengerId <- as.character(test$PassengerId)
summary(train)
##      group      PassengerId         HomePlanet   CryoSleep   
##  0001_01:   1   Length:8693        Europa:2162   FALSE:5558  
##  0002_01:   1   Class :character   Earth :4772   TRUE :3037  
##  0003_01:   1   Mode  :character   Mars  :1759   NA's :  98  
##  0003_02:   1                                                
##  0004_01:   1                                                
##  0005_01:   1                                                
##  (Other):8687                                                
##         Destination        Age           VIP           RoomService   
##  55 Cancri e  :1800   Min.   : 0.00   Mode :logical   Min.   :    0  
##  PSO J318.5-22: 796   1st Qu.:20.00   FALSE:8494      1st Qu.:    0  
##  TRAPPIST-1e  :6097   Median :27.00   TRUE :199       Median :    0  
##                       Mean   :28.83                   Mean   :  220  
##                       3rd Qu.:37.00                   3rd Qu.:   41  
##                       Max.   :79.00                   Max.   :14327  
##                                                                      
##    FoodCourt        ShoppingMall          Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  448.4   Mean   :  169.6   Mean   :  304.6   Mean   :  298.3  
##  3rd Qu.:   61.0   3rd Qu.:   22.0   3rd Qu.:   53.0   3rd Qu.:   40.0  
##  Max.   :29813.0   Max.   :23492.0   Max.   :22408.0   Max.   :24133.0  
##                                                                         
##  Transported       withgroup      deck      side        expense     
##  Mode :logical   Min.   :0   F      :2831   P:4206   Min.   :    0  
##  FALSE:4315      1st Qu.:0   G      :2658   S:4288   1st Qu.:    0  
##  TRUE :4378      Median :0   E      : 876    : 199   Median :  716  
##                  Mean   :0   B      : 842            Mean   : 1441  
##                  3rd Qu.:0   C      : 747            3rd Qu.: 1441  
##                  Max.   :0   D      : 478            Max.   :35987  
##                              (Other): 261                           
##  cryosleep   
##  FALSE:5558  
##  TRUE :3037  
##  NA's :  98  
##              
##              
##              
## 
train$group <- NULL 
test$group <- NULL 
write.csv(train, "train_c.csv", row.names=FALSE)
write.csv(test, "test_c.csv", row.names=FALSE)
write.csv(test, "test_c.csv", row.names=FALSE)
train <- read.csv("train_c.csv")
test <- read.csv("test_c.csv")
paged_table(train)
paged_table(test)
describe_all(train)
## # A tibble: 17 × 8
##    variable     type     na na_pct unique   min    mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>   <dbl> <dbl>
##  1 PassengerId  chr       0    0     8693    NA   NA       NA
##  2 HomePlanet   chr       0    0        3    NA   NA       NA
##  3 CryoSleep    lgl      98    1.1      3     0    0.35     1
##  4 Destination  chr       0    0        3    NA   NA       NA
##  5 Age          dbl       0    0       88     0   28.8     79
##  6 VIP          lgl       0    0        2     0    0.02     1
##  7 RoomService  int       0    0     1273     0  220.   14327
##  8 FoodCourt    int       0    0     1507     0  448.   29813
##  9 ShoppingMall int       0    0     1115     0  170.   23492
## 10 Spa          int       0    0     1327     0  305.   22408
## 11 VRDeck       int       0    0     1306     0  298.   24133
## 12 Transported  lgl       0    0        2     0    0.5      1
## 13 withgroup    int       0    0        1     0    0        0
## 14 deck         chr       0    0        8    NA   NA       NA
## 15 side         chr       0    0        3    NA   NA       NA
## 16 expense      int       0    0     2336     0 1441.   35987
## 17 cryosleep    lgl      98    1.1      3     0    0.35     1
describe_all(test)
## # A tibble: 17 × 8
##    variable     type     na na_pct unique   min    mean   max
##    <chr>        <chr> <int>  <dbl>  <int> <dbl>   <dbl> <dbl>
##  1 PassengerId  chr       0    0     8693    NA   NA       NA
##  2 HomePlanet   chr       0    0        3    NA   NA       NA
##  3 CryoSleep    lgl      98    1.1      3     0    0.35     1
##  4 Destination  chr       0    0        3    NA   NA       NA
##  5 Age          dbl       0    0       88     0   28.8     79
##  6 VIP          lgl       0    0        2     0    0.02     1
##  7 RoomService  int       0    0     1273     0  220.   14327
##  8 FoodCourt    int       0    0     1507     0  448.   29813
##  9 ShoppingMall int       0    0     1115     0  170.   23492
## 10 Spa          int       0    0     1327     0  305.   22408
## 11 VRDeck       int       0    0     1306     0  298.   24133
## 12 Transported  lgl       0    0        2     0    0.5      1
## 13 withgroup    int       0    0        1     0    0        0
## 14 deck         chr       0    0        8    NA   NA       NA
## 15 side         chr       0    0        3    NA   NA       NA
## 16 expense      int       0    0     2336     0 1441.   35987
## 17 cryosleep    lgl      98    1.1      3     0    0.35     1
test$HomePlanet <- as.factor(test$HomePlanet)
test$Destination <- as.factor(test$Destination)
test$deck <- as.factor(test$deck)
test$side <- as.factor(test$side)
train$HomePlanet <- as.factor(train$HomePlanet)
train$Destination <- as.factor(train$Destination)
train$deck <- as.factor(train$deck)
train$side <- as.factor(train$side)
D <- train[,2:16] %>% mutate(across(everything(), ~as.integer(.)))
kor <- cor(D)
## Warning in cor(D): the standard deviation is zero
library(corrplot)
## corrplot 0.92 loaded
corrplot.mixed(kor)

library(DataExplorer)
create_report(train)
## 
## 
## processing file: report.rmd
##   |                                             |                                     |   0%  |                                             |.                                    |   2%                                   |                                             |..                                   |   5% [global_options]                  |                                             |...                                  |   7%                                   |                                             |....                                 |  10% [introduce]                       |                                             |....                                 |  12%                                   |                                             |.....                                |  14% [plot_intro]                      |                                             |......                               |  17%                                   |                                             |.......                              |  19% [data_structure]                  |                                             |........                             |  21%                                   |                                             |.........                            |  24% [missing_profile]                 |                                             |..........                           |  26%                                   |                                             |...........                          |  29% [univariate_distribution_header]  |                                             |...........                          |  31%                                   |                                             |............                         |  33% [plot_histogram]                  |                                             |.............                        |  36%                                   |                                             |..............                       |  38% [plot_density]                    |                                             |...............                      |  40%                                   |                                             |................                     |  43% [plot_frequency_bar]              |                                             |.................                    |  45%                                   |                                             |..................                   |  48% [plot_response_bar]               |                                             |..................                   |  50%                                   |                                             |...................                  |  52% [plot_with_bar]                   |                                             |....................                 |  55%                                   |                                             |.....................                |  57% [plot_normal_qq]                  |                                             |......................               |  60%                                   |                                             |.......................              |  62% [plot_response_qq]                |                                             |........................             |  64%                                   |                                             |.........................            |  67% [plot_by_qq]                      |                                             |..........................           |  69%                                   |                                             |..........................           |  71% [correlation_analysis]            |                                             |...........................          |  74%                                   |                                             |............................         |  76% [principal_component_analysis]    |                                             |.............................        |  79%                                   |                                             |..............................       |  81% [bivariate_distribution_header]   |                                             |...............................      |  83%                                   |                                             |................................     |  86% [plot_response_boxplot]           |                                             |.................................    |  88%                                   |                                             |.................................    |  90% [plot_by_boxplot]                 |                                             |..................................   |  93%                                   |                                             |...................................  |  95% [plot_response_scatterplot]       |                                             |.................................... |  98%                                   |                                             |.....................................| 100% [plot_by_scatterplot]           
## output file: C:/Users/cabdi/OneDrive/Documents/titanic proje2/report.knit.md
## "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/pandoc" +RTS -K512m -RTS "C:\Users\cabdi\OneDrive\DOCUME~1\TITANI~1\REPORT~1.MD" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc3d08774341fb.html --lua-filter "C:\Users\cabdi\AppData\Local\R\win-library\4.4\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\cabdi\AppData\Local\R\win-library\4.4\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\cabdi\AppData\Local\R\win-library\4.4\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\cabdi\AppData\Local\Temp\RtmpA5InqM\rmarkdown-str3d08693e2b9c.html"
## 
## Output created: report.html
model <- lm(Transported ~ ., data = train[, 2:16]) 
summary(model)
## 
## Call:
## lm(formula = Transported ~ ., data = train[, 2:16])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47726 -0.30338 -0.03261  0.28466  1.74104 
## 
## Coefficients: (2 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.595e-01  4.921e-02   7.306 3.00e-13 ***
## HomePlanetEuropa          1.853e-01  2.783e-02   6.656 2.98e-11 ***
## HomePlanetMars            8.492e-02  1.463e-02   5.804 6.71e-09 ***
## CryoSleepTRUE             3.970e-01  1.166e-02  34.062  < 2e-16 ***
## DestinationPSO J318.5-22 -4.581e-02  1.812e-02  -2.528   0.0115 *  
## DestinationTRAPPIST-1e   -4.454e-02  1.136e-02  -3.920 8.93e-05 ***
## Age                      -2.258e-03  3.156e-04  -7.154 9.13e-13 ***
## VIPTRUE                  -3.282e-02  2.967e-02  -1.106   0.2687    
## RoomService              -1.143e-04  7.054e-06 -16.198  < 2e-16 ***
## FoodCourt                 4.453e-05  3.057e-06  14.567  < 2e-16 ***
## ShoppingMall              8.238e-05  7.452e-06  11.056  < 2e-16 ***
## Spa                      -8.445e-05  4.111e-06 -20.543  < 2e-16 ***
## VRDeck                   -8.076e-05  4.113e-06 -19.637  < 2e-16 ***
## withgroup                        NA         NA      NA       NA    
## deckB                     1.156e-01  2.906e-02   3.977 7.05e-05 ***
## deckC                     1.445e-01  2.939e-02   4.917 8.94e-07 ***
## deckD                     3.850e-02  3.486e-02   1.105   0.2694    
## deckE                    -9.918e-03  3.595e-02  -0.276   0.7826    
## deckF                     8.113e-02  3.666e-02   2.213   0.0269 *  
## deckG                     2.612e-02  3.823e-02   0.683   0.4945    
## deckT                     5.196e-02  1.810e-01   0.287   0.7741    
## sideP                    -2.148e-02  2.960e-02  -0.726   0.4681    
## sideS                     6.536e-02  2.958e-02   2.210   0.0271 *  
## expense                          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4002 on 8573 degrees of freedom
##   (98 observations deleted due to missingness)
## Multiple R-squared:  0.3609, Adjusted R-squared:  0.3594 
## F-statistic: 230.6 on 21 and 8573 DF,  p-value: < 2.2e-16
library(caTools)
set.seed(123)
split = sample.split(train$Transported, SplitRatio = 0.75)
train_train = subset(train, split == TRUE)
train_test = subset(train, split == FALSE)
regresyon <- lm(Transported ~ ., data = train_train[, -c(1)])
summary(regresyon)
## 
## Call:
## lm(formula = Transported ~ ., data = train_train[, -c(1)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.55246 -0.30616 -0.03245  0.28473  1.72031 
## 
## Coefficients: (3 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.804e-01  5.691e-02   6.685 2.51e-11 ***
## HomePlanetEuropa          1.718e-01  3.220e-02   5.337 9.79e-08 ***
## HomePlanetMars            7.364e-02  1.686e-02   4.368 1.27e-05 ***
## CryoSleepTRUE             3.923e-01  1.343e-02  29.202  < 2e-16 ***
## DestinationPSO J318.5-22 -6.199e-02  2.087e-02  -2.970 0.002989 ** 
## DestinationTRAPPIST-1e   -4.951e-02  1.312e-02  -3.773 0.000163 ***
## Age                      -2.396e-03  3.666e-04  -6.535 6.84e-11 ***
## VIPTRUE                  -1.085e-02  3.376e-02  -0.321 0.747941    
## RoomService              -1.138e-04  7.789e-06 -14.615  < 2e-16 ***
## FoodCourt                 4.044e-05  3.468e-06  11.662  < 2e-16 ***
## ShoppingMall              8.551e-05  8.406e-06  10.173  < 2e-16 ***
## Spa                      -8.337e-05  4.707e-06 -17.712  < 2e-16 ***
## VRDeck                   -8.022e-05  4.772e-06 -16.813  < 2e-16 ***
## withgroup                        NA         NA      NA       NA    
## deckB                     1.069e-01  3.323e-02   3.216 0.001305 ** 
## deckC                     1.360e-01  3.355e-02   4.054 5.09e-05 ***
## deckD                     1.807e-02  3.969e-02   0.455 0.648946    
## deckE                    -4.120e-02  4.148e-02  -0.993 0.320636    
## deckF                     6.175e-02  4.211e-02   1.466 0.142587    
## deckG                     9.273e-03  4.394e-02   0.211 0.832861    
## deckT                     4.238e-02  1.816e-01   0.233 0.815497    
## sideP                    -1.001e-02  3.456e-02  -0.290 0.772074    
## sideS                     8.117e-02  3.452e-02   2.351 0.018753 *  
## expense                          NA         NA      NA       NA    
## cryosleepTRUE                    NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4002 on 6424 degrees of freedom
##   (74 observations deleted due to missingness)
## Multiple R-squared:  0.3616, Adjusted R-squared:  0.3596 
## F-statistic: 173.3 on 21 and 6424 DF,  p-value: < 2.2e-16
reg_tahmin = predict(regresyon, newdata = train_test[, -c(1,12)])
reg_transported_tahmin <- ifelse(reg_tahmin > 0.5, 1, 0)
transported_gercek <- ifelse(train_test[12] == TRUE, 1,0)
cm = table(transported_gercek, reg_transported_tahmin)
cm
##                   reg_transported_tahmin
## transported_gercek   0   1
##                  0 903 170
##                  1 315 761
(903 + 761)/(903 + 761 + 315 + 170)
## [1] 0.7743136
reg_tahmin = predict(regresyon, newdata = train_train[, -c(1,12)])
reg_transported_tahmin <- ifelse(reg_tahmin > 0.5, 1, 0)
transported_gercek <- ifelse(train_train[12] == TRUE, 1,0)
cm = table(transported_gercek, reg_transported_tahmin)
cm
##                   reg_transported_tahmin
## transported_gercek    0    1
##                  0 2668  547
##                  1  930 2301
(2668 + 2301)/(2668 + 2301 + 930 +547)
## [1] 0.7708657
reg_tahmin_bd = predict(model, newdata = test[, -c(1)])
reg_transported_test_tahmin <- ifelse(reg_tahmin_bd > 0.5, TRUE, FALSE)
Transported <- as.character(reg_transported_test_tahmin)
PassengerId <- test$PassengerId
Transported<-as.vector(Transported)
submission_regrasyon <- cbind(PassengerId, Transported)
submission_regrasyon <- as.data.frame(submission_regrasyon)
library(stringr)
submission_regrasyon$Transported <- str_to_title(submission_regrasyon$Transported)
write.csv(submission_regrasyon, "siniftahmini.csv", row.names = FALSE, quote = FALSE)
library(glmulti)
## Zorunlu paket yükleniyor: rJava
## Zorunlu paket yükleniyor: leaps
regresyon_opt <- glmulti(Transported ~ HomePlanet + Cryosleep + Destination + Age + VIP + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck + withgroup + deck + side, + level = 1, crit = bic, data =train)
modelglmulti <- lm(Transported ~ 1 + HomePlanet +  Destination + deck + side + Cryosleep + Age + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck, data = train )
reg_tahmin_glmulti = predict(modelglmulti, newdata = test[, -c(1)])
reg_transported_test_tahmin_glmulti <- ifelse(reg_tahmin_glmulti > 0.5, TRUE, FALSE)
Transported <- as.character(reg_transported_test_tahmin_glmulti)
PassengerId <- test$PassengerId
Transported<-as.vector(Transported)
submission_regrasyon_glmulti <- cbind(PassengerId, Transported)
submission_regrasyon_glmulti <- as.data.frame(submission_regrasyon_glmulti)
submission_regrasyon_glmulti$Transported <- str_to_title(submission_regrasyon_glmulti$Transpored)
write.csv(submission_regrasyon_glmulti, "submission_regrasyon_glmulti.csv", row.names =FALSE, quote=FALSE)
train_log <- train %>% 
  mutate_at(c(5, 7:11, 16), ~log(1 + .))
test_log <- test %>% 
  mutate_at(c(5, 7:11, 16), ~log(1 + .))
modellog <- lm(Transported ~ 1 + HomePlanet + Destination + deck + side + cryosleep +Age + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck, data = train_log)
reg_tahmin_log = predict(modellog, newdata = test_log[, -c(1)])
reg_transported_test_tahmin_log <- ifelse(reg_tahmin_log > 0.5, TRUE, FALSE)
Transported <- as.character(reg_transported_test_tahmin_log)
PassengerId <- test$PassengerId
Transported<-as.vector(Transported)
submission_regrasyon_log <- cbind(PassengerId, Transported)
submission_regrasyon_log <- as.data.frame(submission_regrasyon_log)

fotograf

submission_regrasyon_log$Transported <- str_to_title(submission_regrasyon_log$Transported)
write.csv(submission_regrasyon_log, "submission_regrasyon_log_csv", row.names =FALSE, quote=FALSE)
logistic = glm(formula = Transported ~ ., 
               family = binomial,
               data = train_train[, -c(1)])
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic)
## 
## Call:
## glm(formula = Transported ~ ., family = binomial, data = train_train[, 
##     -c(1)])
## 
## Coefficients: (3 not defined because of singularities)
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.176e-02  4.265e-01  -0.028 0.978004    
## HomePlanetEuropa          1.313e+00  2.522e-01   5.207 1.92e-07 ***
## HomePlanetMars            4.706e-01  1.098e-01   4.285 1.83e-05 ***
## CryoSleepTRUE             1.345e+00  9.356e-02  14.374  < 2e-16 ***
## DestinationPSO J318.5-22 -5.119e-01  1.307e-01  -3.918 8.94e-05 ***
## DestinationTRAPPIST-1e   -4.505e-01  9.464e-02  -4.760 1.94e-06 ***
## Age                      -9.359e-03  2.419e-03  -3.869 0.000109 ***
## VIPTRUE                  -1.637e-01  2.926e-01  -0.560 0.575788    
## RoomService              -1.692e-03  1.136e-04 -14.891  < 2e-16 ***
## FoodCourt                 4.599e-04  4.441e-05  10.357  < 2e-16 ***
## ShoppingMall              5.432e-04  7.598e-05   7.149 8.77e-13 ***
## Spa                      -1.924e-03  1.160e-04 -16.579  < 2e-16 ***
## VRDeck                   -1.852e-03  1.155e-04 -16.038  < 2e-16 ***
## withgroup                        NA         NA      NA       NA    
## deckB                     1.198e+00  2.935e-01   4.080 4.50e-05 ***
## deckC                     2.328e+00  3.316e-01   7.020 2.22e-12 ***
## deckD                     4.992e-01  3.216e-01   1.552 0.120605    
## deckE                    -1.290e-01  3.272e-01  -0.394 0.693288    
## deckF                     5.016e-01  3.306e-01   1.517 0.129186    
## deckG                     7.179e-02  3.402e-01   0.211 0.832849    
## deckT                    -3.072e-01  1.793e+00  -0.171 0.863922    
## sideP                    -1.586e-01  2.462e-01  -0.644 0.519550    
## sideS                     4.515e-01  2.465e-01   1.832 0.067006 .  
## expense                          NA         NA      NA       NA    
## cryosleepTRUE                    NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8936.0  on 6445  degrees of freedom
## Residual deviance: 5539.3  on 6424  degrees of freedom
##   (74 observations deleted due to missingness)
## AIC: 5583.3
## 
## Number of Fisher Scoring iterations: 7
logistic_tahmin = predict(logistic, newdata = train_test[, -c(1,12)])
head(logistic_tahmin)
##           2           3           7          14          16          21 
##  -1.0380056 -10.8408556   0.8859320  -1.6134173  -0.2589114  -1.7534862
logistic_transported_tahmin <- ifelse(logistic_tahmin > 0.5, 1, 0)
transported_gercek <- ifelse(train_test[12] == TRUE,1,0)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ data.table::between()   masks dplyr::between()
## ✖ scales::discard()       masks purrr::discard()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ data.table::first()     masks dplyr::first()
## ✖ recipes::fixed()        masks stringr::fixed()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ data.table::last()      masks dplyr::last()
## ✖ yardstick::spec()       masks readr::spec()
## ✖ recipes::step()         masks stats::step()
## ✖ data.table::transpose() masks purrr::transpose()
## • Learn how to get started at https://www.tidymodels.org/start/
result = data.frame(cbind(transported_gercek, logistic_transported_tahmin))
result$Transported <- as.factor(result$Transported)
result$logistic_transported_tahmin <- as.factor(result$logistic_transported_tahmin)
accuracy(result,truth = Transported, estimate = logistic_transported_tahmin)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.785
conf_mat(result, truth = Transported, estimate = logistic_transported_tahmin)
##           Truth
## Prediction   0   1
##          0 913 302
##          1 160 774
library(caret)
## Zorunlu paket yükleniyor: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
## 
##     lift
cm = table(transported_gercek, logistic_transported_tahmin)
confusionMatrix(as.factor(transported_gercek), as.factor(logistic_transported_tahmin))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 913 160
##          1 302 774
##                                          
##                Accuracy : 0.785          
##                  95% CI : (0.767, 0.8022)
##     No Information Rate : 0.5654         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5701         
##                                          
##  Mcnemar's Test P-Value : 5.384e-11      
##                                          
##             Sensitivity : 0.7514         
##             Specificity : 0.8287         
##          Pos Pred Value : 0.8509         
##          Neg Pred Value : 0.7193         
##              Prevalence : 0.5654         
##          Detection Rate : 0.4248         
##    Detection Prevalence : 0.4993         
##       Balanced Accuracy : 0.7901         
##                                          
##        'Positive' Class : 0              
## 
logistic_bd = glm(formula = Transported ~ ., 
                  family = binomial, 
                  data = train[, -c(1)])
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
logistic_tahmin_bd = predict(logistic_bd, newdata = test[, -c(1)])
logistic_transported_test_tahmin <- ifelse(logistic_tahmin_bd > 0.5, TRUE, FALSE)
Transported <- as.character(logistic_transported_test_tahmin)
PassengerId <- test$PassengerId
Transported <- as.character(logistic_transported_test_tahmin)
PassengerId <- test$PassengerId
Transported<-as.vector(Transported)
submision_logistic <- cbind(PassengerId, Transported)
submision_logistic <- as.data.frame(submision_logistic)
submision_logistic$Transported <- 
  str_to_title(submision_logistic$Transported)
write.csv(submision_logistic, "submision_logistic.csv", row.names =FALSE, quote = FALSE)

NAIVE BAYES

library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:tune':
## 
##     tune
## The following object is masked from 'package:rsample':
## 
##     permutations
## The following object is masked from 'package:parsnip':
## 
##     tune
fit_nb <- naiveBayes(Transported ~ ., data =train_train[, -1])
fit_nb
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    FALSE     TRUE 
## 0.496319 0.503681 
## 
## Conditional probabilities:
##        HomePlanet
## Y           Earth    Europa      Mars
##   FALSE 0.6251545 0.1749073 0.1999382
##   TRUE  0.4649817 0.3285627 0.2064555
## 
##        CryoSleep
## Y           FALSE      TRUE
##   FALSE 0.8693624 0.1306376
##   TRUE  0.4243268 0.5756732
## 
##        Destination
## Y       55 Cancri e PSO J318.5-22 TRAPPIST-1e
##   FALSE  0.16069221    0.09363412  0.74567367
##   TRUE   0.24969549    0.09043849  0.65986602
## 
##        Age
## Y           [,1]     [,2]
##   FALSE 30.01152 13.45216
##   TRUE  27.84835 14.87502
## 
##        VIP
## Y            FALSE       TRUE
##   FALSE 0.97126082 0.02873918
##   TRUE  0.98142509 0.01857491
## 
##        RoomService
## Y            [,1]     [,2]
##   FALSE 402.20365 916.6547
##   TRUE   56.93484 246.8105
## 
##        FoodCourt
## Y           [,1]     [,2]
##   FALSE 396.2923 1258.123
##   TRUE  514.8018 1918.620
## 
##        ShoppingMall
## Y           [,1]     [,2]
##   FALSE 160.2250 432.1103
##   TRUE  182.7135 748.5867
## 
##        Spa
## Y            [,1]      [,2]
##   FALSE 561.16193 1554.2636
##   TRUE   61.11571  264.0556
## 
##        VRDeck
## Y            [,1]      [,2]
##   FALSE 533.98733 1542.8703
##   TRUE   69.07186  295.0542
## 
##        withgroup
## Y       [,1] [,2]
##   FALSE    0    0
##   TRUE     0    0
## 
##        deck
## Y                  A            B            C            D            E
##   FALSE 0.0296662546 0.0553152040 0.0568603214 0.0655129790 0.1338071693
##   TRUE  0.0301461632 0.1419001218 0.1178440926 0.0484165652 0.0672959805
##        deck
## Y                  F            G            T
##   FALSE 0.3634116193 0.2941903585 0.0012360939
##   TRUE  0.2834957369 0.3105968331 0.0003045067
## 
##        side
## Y                           P          S
##   FALSE 0.02348578 0.54079110 0.43572311
##   TRUE  0.02131547 0.43300853 0.54567600
## 
##        expense
## Y            [,1]     [,2]
##   FALSE 2053.8702 3224.219
##   TRUE   884.6376 2307.962
## 
##        cryosleep
## Y           FALSE      TRUE
##   FALSE 0.8693624 0.1306376
##   TRUE  0.4243268 0.5756732
pred_nb <- predict(fit_nb, newdata = train_test[, -c(1, 12)], type = "raw") %>% data.frame()
head(pred_nb)
##       FALSE.         TRUE.
## 1 0.33040033  6.695997e-01
## 2 1.00000000 3.164597e-135
## 3 0.06879377  9.312062e-01
## 4 0.71667750  2.833225e-01
## 5 0.03632331  9.636767e-01
## 6 0.62129515  3.787048e-01
Transported_pred_nb = ifelse(pred_nb$TRUE. > 0.5, 1, 0)
head(Transported_pred_nb)
## [1] 1 0 1 0 1 0
Transported_test_train <- ifelse(train_test[12] == TRUE, 1, 0)
head(Transported_test_train)
##    Transported
## 2            1
## 3            0
## 7            1
## 14           0
## 16           0
## 21           0
cm= table(Transported_test_train, Transported_pred_nb)
cm
##                       Transported_pred_nb
## Transported_test_train    0    1
##                      0  534  545
##                      1   86 1008
(534 + 1008) / (534 + 1008 + 86 + 545)
## [1] 0.709618
nb <- naiveBayes(Transported ~ ., data = train)
pred_nb <- predict(fit_nb, newdata = test,  type = "raw") %>% data.frame()
Transported_pred_nb = ifelse(pred_nb$TRUE. > 0.5, TRUE, FALSE)
Transported <- as.character(Transported_pred_nb)
PassengerId <- test$PassengerId
Transported <- as.vector(Transported)
sample_submision <- cbind(PassengerId, Transported)
sample_submision <- as.data.frame(sample_submision)
sample_submision$Transported <-str_to_title(sample_submision$Transported)
write.csv(sample_submision, "sub_svm_csv", row.names = FALSE, quote = FALSE)

SVM

fit_svm <- svm(Transported ~ .,data = train_train[, -1],
               type = 'C-classification',
               kernel = 'linear') 
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'withgroup' constant. Cannot scale data.
preds <- predict(fit_svm, newdata = train_test[, -c(1, 12)], type = "raw") %>% data.frame()
head(preds)
##        .
## 2   TRUE
## 3  FALSE
## 7   TRUE
## 14  TRUE
## 16  TRUE
## 21 FALSE
cm
##                       Transported_pred_nb
## Transported_test_train    0    1
##                      0  534  545
##                      1   86 1008
(534 + 1009) / (534 + 1008 + 86 + 545)
## [1] 0.7100782
fit_svm <- svm(Transported ~ ., data = train[, -1], 
               type = 'C-classification')
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'withgroup' constant. Cannot scale data.
preds <- predict(fit_svm, newdata = train_test[, -c(1, 12)], type = "raw") %>% data.frame()
Transported_pred_svm = ifelse(preds$. == TRUE, TRUE, FALSE)
Transported <- as.character(Transported_pred_svm)
PassengerId <- test$PassengerId
Transported <- as.vector(Transported)
sample_submision <- cbind(PassengerId, Transported)
## Warning in cbind(PassengerId, Transported): number of rows of result is not a
## multiple of vector length (arg 2)
sample_submision <- as.data.frame(sample_submision)
write.csv(sample_submision, "sub_svm_csv", row.names = FALSE, quote = FALSE)

SVM radial

fit_svm <- svm(Transported ~ ., data = train_train[, -1], 
      type= 'C-classification',
       kernel = 'radial' )
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'withgroup' constant. Cannot scale data.
preds <- predict(fit_svm, newdata= train_test[, -c(1,12)], type= "raw") %>% data.frame() 
Transported_pred_svm <- ifelse(preds$preds == TRUE, 1, 0)
cm
##                       Transported_pred_nb
## Transported_test_train    0    1
##                      0  534  545
##                      1   86 1008
(534 + 1008) / (534 + 545 + 86 + 1008)
## [1] 0.709618
fit_svm <- svm(Transported ~ ., data = train[, -1], 
        type= 'C-classification',
       kernel = 'radial' )
## Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
## Variable(s) 'withgroup' constant. Cannot scale data.
preds <- predict(fit_svm, newdata= test, type= "raw") %>% data.frame()
Transported_pred_svm = ifelse(preds$.== TRUE,TRUE,FALSE)
Transported <- as.character(Transported_pred_svm)
PassengerId <- test$PassengerId
Transported <- as.vector(Transported)
sample_submission <- cbind(PassengerId,Transported)
## Warning in cbind(PassengerId, Transported): number of rows of result is not a
## multiple of vector length (arg 2)
sample_submission <- as.data.frame(sample_submission)
sample_submission$Transported <- str_to_title(sample_submission$Transported)
write.csv(sample_submission,"sub_svm_radial.csv",row.names = FALSE, quote = FALSE)
P <- ggplot(train_train,aes(x=HomePlanet, y=deck,color=factor(Transported))) +
  geom_point(aes(shape=factor(Transported)), size=3) +
  scale_color_viridis_d() +
  labs(title = "", x="HomePlanet", y="deck") +
  theme_minimal() +
  theme(legend.position = "top")
P

library(rpart)
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
fit_tree <- rpart::rpart(Transported ~ ., data = train_train[, -1])
summary(fit_tree)
## Call:
## rpart::rpart(formula = Transported ~ ., data = train_train[, 
##     -1])
##   n= 6520 
## 
##           CP nsplit rel error    xerror         xstd
## 1 0.23321359      0 1.0000000 1.0001811 0.0001977604
## 2 0.04981598      1 0.7667864 0.7671015 0.0103212807
## 3 0.02807107      2 0.7169704 0.7173428 0.0090686704
## 4 0.02784643      3 0.6888994 0.7031918 0.0095398822
## 5 0.01770479      4 0.6610529 0.6654236 0.0096767137
## 6 0.01381587      5 0.6433481 0.6488743 0.0100411278
## 7 0.01194576      6 0.6295323 0.6381676 0.0101816720
## 8 0.01164524      7 0.6175865 0.6303299 0.0102037441
## 9 0.01000000      8 0.6059413 0.6235169 0.0102521030
## 
## Variable importance
##      expense    cryosleep    CryoSleep    FoodCourt          Spa       VRDeck 
##           21           16           16           12           11           10 
##         deck   HomePlanet ShoppingMall          Age  Destination 
##            5            5            2            1            1 
## 
## Node number 1: 6520 observations,    complexity param=0.2332136
##   mean=0.503681, MSE=0.2499865 
##   left son=2 (3795 obs) right son=3 (2725 obs)
##   Primary splits:
##       expense     < 0.5     to the right, improve=0.2332136, (0 missing)
##       CryoSleep   < 0.5     to the left,  improve=0.2141494, (74 missing)
##       cryosleep   < 0.5     to the left,  improve=0.2141494, (74 missing)
##       RoomService < 0.5     to the right, improve=0.1254121, (0 missing)
##       Spa         < 0.5     to the right, improve=0.1142180, (0 missing)
##   Surrogate splits:
##       CryoSleep < 0.5     to the left,  agree=0.932, adj=0.837, (0 split)
##       cryosleep < 0.5     to the left,  agree=0.932, adj=0.837, (0 split)
##       Spa       < 0.5     to the right, agree=0.784, adj=0.484, (0 split)
##       FoodCourt < 0.5     to the right, agree=0.770, adj=0.451, (0 split)
##       VRDeck    < 0.5     to the right, agree=0.768, adj=0.444, (0 split)
## 
## Node number 2: 3795 observations,    complexity param=0.02807107
##   mean=0.2990777, MSE=0.2096302 
##   left son=4 (3243 obs) right son=5 (552 obs)
##   Primary splits:
##       FoodCourt    < 1331    to the left,  improve=0.05751186, (0 missing)
##       ShoppingMall < 627.5   to the left,  improve=0.04530804, (0 missing)
##       RoomService  < 365.5   to the right, improve=0.04440387, (0 missing)
##       Spa          < 257.5   to the right, improve=0.03347453, (0 missing)
##       VRDeck       < 721     to the right, improve=0.02290457, (0 missing)
##   Surrogate splits:
##       expense    < 5981    to the left,  agree=0.885, adj=0.210, (0 split)
##       deck       splits as  RRRLLLLL,    agree=0.884, adj=0.201, (0 split)
##       HomePlanet splits as  LRL,         agree=0.878, adj=0.161, (0 split)
##       Spa        < 8955.5  to the left,  agree=0.856, adj=0.009, (0 split)
##       VRDeck     < 11692   to the left,  agree=0.856, adj=0.009, (0 split)
## 
## Node number 3: 2725 observations,    complexity param=0.04981598
##   mean=0.7886239, MSE=0.1666963 
##   left son=6 (1449 obs) right son=7 (1276 obs)
##   Primary splits:
##       deck        splits as  RRRRLRL-,    improve=0.17874760, (0 missing)
##       HomePlanet  splits as  LRR,         improve=0.12440710, (0 missing)
##       Destination splits as  RLL,         improve=0.02625136, (0 missing)
##       CryoSleep   < 0.5     to the left,  improve=0.02267577, (74 missing)
##       cryosleep   < 0.5     to the left,  improve=0.02267577, (74 missing)
##   Surrogate splits:
##       HomePlanet  splits as  LRR,         agree=0.933, adj=0.857, (0 split)
##       Age         < 24.5    to the left,  agree=0.625, adj=0.200, (0 split)
##       Destination splits as  RLL,         agree=0.591, adj=0.126, (0 split)
##       VIP         < 0.5     to the left,  agree=0.538, adj=0.014, (0 split)
##       side        splits as  RLL,         agree=0.533, adj=0.002, (0 split)
## 
## Node number 4: 3243 observations,    complexity param=0.02784643
##   mean=0.2537774, MSE=0.1893744 
##   left son=8 (2577 obs) right son=9 (666 obs)
##   Primary splits:
##       ShoppingMall < 541.5   to the left,  improve=0.07390355, (0 missing)
##       RoomService  < 365.5   to the right, improve=0.03464407, (0 missing)
##       Spa          < 240.5   to the right, improve=0.03327259, (0 missing)
##       VRDeck       < 114     to the right, improve=0.02784287, (0 missing)
##       expense      < 2867.5  to the right, improve=0.01811461, (0 missing)
##   Surrogate splits:
##       expense < 18644   to the left,  agree=0.795, adj=0.003, (0 split)
## 
## Node number 5: 552 observations,    complexity param=0.01770479
##   mean=0.5652174, MSE=0.2457467 
##   left son=10 (123 obs) right son=11 (429 obs)
##   Primary splits:
##       Spa     < 1372.5  to the right, improve=0.21272970, (0 missing)
##       VRDeck  < 1063.5  to the right, improve=0.17089500, (0 missing)
##       expense < 5395    to the right, improve=0.06611166, (0 missing)
##       deck    splits as  LLRLLRRL,    improve=0.02812225, (0 missing)
##       side    splits as  LLR,         improve=0.02807513, (0 missing)
##   Surrogate splits:
##       expense     < 12647   to the right, agree=0.790, adj=0.057, (0 split)
##       Age         < 13.5    to the left,  agree=0.779, adj=0.008, (0 split)
##       RoomService < 3895.5  to the right, agree=0.779, adj=0.008, (0 split)
## 
## Node number 6: 1449 observations
##   mean=0.6266391, MSE=0.2339625 
## 
## Node number 7: 1276 observations
##   mean=0.9725705, MSE=0.02667709 
## 
## Node number 8: 2577 observations,    complexity param=0.01194576
##   mean=0.193636, MSE=0.1561411 
##   left son=16 (2067 obs) right son=17 (510 obs)
##   Primary splits:
##       FoodCourt   < 456.5   to the left,  improve=0.04838893, (0 missing)
##       expense     < 1447.5  to the right, improve=0.04016842, (0 missing)
##       HomePlanet  splits as  RLL,         improve=0.02438006, (0 missing)
##       Spa         < 537.5   to the right, improve=0.01895890, (0 missing)
##       RoomService < 400.5   to the right, improve=0.01706521, (0 missing)
##   Surrogate splits:
##       expense < 12373   to the left,  agree=0.804, adj=0.008, (0 split)
##       Spa     < 13650   to the left,  agree=0.803, adj=0.006, (0 split)
##       VRDeck  < 10123.5 to the left,  agree=0.802, adj=0.002, (0 split)
##       deck    splits as  LLLLLLLR,    agree=0.802, adj=0.002, (0 split)
## 
## Node number 9: 666 observations
##   mean=0.4864865, MSE=0.2498174 
## 
## Node number 10: 123 observations
##   mean=0.1382114, MSE=0.119109 
## 
## Node number 11: 429 observations,    complexity param=0.01381587
##   mean=0.6876457, MSE=0.2147891 
##   left son=22 (143 obs) right son=23 (286 obs)
##   Primary splits:
##       VRDeck      < 611     to the right, improve=0.24438400, (0 missing)
##       Spa         < 225     to the right, improve=0.05300377, (0 missing)
##       FoodCourt   < 3119.5  to the left,  improve=0.05168044, (0 missing)
##       side        splits as  LLR,         improve=0.04323810, (0 missing)
##       RoomService < 1719.5  to the right, improve=0.03930897, (0 missing)
##   Surrogate splits:
##       expense   < 6032    to the right, agree=0.702, adj=0.105, (0 split)
##       Age       < 53.5    to the right, agree=0.674, adj=0.021, (0 split)
##       FoodCourt < 12128.5 to the right, agree=0.671, adj=0.014, (0 split)
## 
## Node number 16: 2067 observations
##   mean=0.1504596, MSE=0.1278215 
## 
## Node number 17: 510 observations,    complexity param=0.01164524
##   mean=0.3686275, MSE=0.2327413 
##   left son=34 (204 obs) right son=35 (306 obs)
##   Primary splits:
##       expense    < 1447.5  to the right, improve=0.15990760, (0 missing)
##       VRDeck     < 86.5    to the right, improve=0.10060710, (0 missing)
##       HomePlanet splits as  RLL,         improve=0.07491751, (0 missing)
##       Spa        < 500     to the right, improve=0.07353369, (0 missing)
##       deck       splits as  LLLLRRRL,    improve=0.05075444, (0 missing)
##   Surrogate splits:
##       HomePlanet splits as  RLL,         agree=0.867, adj=0.667, (0 split)
##       deck       splits as  LLLLRRRL,    agree=0.839, adj=0.598, (0 split)
##       VRDeck     < 213.5   to the right, agree=0.818, adj=0.544, (0 split)
##       Spa        < 219.5   to the right, agree=0.792, adj=0.480, (0 split)
##       FoodCourt  < 907     to the right, agree=0.722, adj=0.304, (0 split)
## 
## Node number 22: 143 observations
##   mean=0.3636364, MSE=0.231405 
## 
## Node number 23: 286 observations
##   mean=0.8496503, MSE=0.1277446 
## 
## Node number 34: 204 observations
##   mean=0.1323529, MSE=0.1148356 
## 
## Node number 35: 306 observations
##   mean=0.5261438, MSE=0.2493165
rpart.plot(fit_tree)

preds= predict(fit_tree,newdata = train_test[, -c(1,12)]) %>%
  data.frame()
head(preds)
##            .
## 2  0.1504596
## 3  0.1382114
## 7  0.8496503
## 14 0.1504596
## 16 0.4864865
## 21 0.1504596
Transported_pred_tree = ifelse(preds$. >0.5, 1, 0)

cm
##                       Transported_pred_nb
## Transported_test_train    0    1
##                      0  534  545
##                      1   86 1008
(534+1008)/(534 +1008+545 +86)
## [1] 0.709618
fit_tree <- rpart(Transported ~ ., data = train[, -1])

preds <- predict(fit_tree, newdata = test) %>%
  data.frame()
Transported_pred_tree = ifelse(preds$. > 0.5, TRUE, FALSE)
Transported <- as.character(Transported_pred_tree)
PassengerId <- test$PassengerId
Transported <- as.vector(Transported)
sample_submission <- cbind(PassengerId, Transported)
sample_submission <- as.data.frame(sample_submission) 
sample_submission$Transported <- str_to_title(sample_submission$Transported)
write.csv(sample_submission, "sub_forest.csv", row.names = FALSE, quote = FALSE)

Random Forest

data <- na.omit(data())
fit_forest <- randomForest(Transported ~ ., data = train_train[, -1], na.action = na.omit)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values.  Are you sure you want to do regression?
fit_forest$importance
##              IncNodePurity
## HomePlanet       38.741878
## CryoSleep        80.608839
## Destination      24.787945
## Age              96.824678
## VIP               2.387148
## RoomService     102.275704
## FoodCourt       114.963217
## ShoppingMall     92.371759
## Spa             117.267689
## VRDeck          103.483651
## withgroup         0.000000
## deck             94.273488
## side             25.773873
## expense         272.213144
## cryosleep        67.135536
varImpPlot(fit_forest)

head(preds)
##           .
## 1 0.9722222
## 2 0.1590679
## 3 0.1260870
## 4 0.1260870
## 5 0.1590679
## 6 0.1590679
preds <- predict(fit_forest, newdata = test) %>%
    data.frame()
Transported <- as.character(Transported_pred_tree)
PassengerId <- test$PassengerId
Transported <- as.vector(Transported)
sample_submission <- cbind(PassengerId, Transported)
sample_submission <- as.data.frame(sample_submission) 
sample_submission$Transported <- str_to_title(sample_submission$Transported)
write.csv(sample_submission, "sub_forest.csv", row.names = FALSE, quote = FALSE)