titanic_dataset

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

remove(list = ls())

train <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/Summer 2024/Day 7/train.csv")

1. Run some preliminary correlations of Survived with some other variables.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# cor(train)
train_numeric <- select(.data = train, -Name, -Sex, -Ticket,-Cabin,-Embarked)

train_corr_matrix <- cor(train_numeric, use = "pairwise.complete.obs")
?cor

#install.packages(ggcorrplot)
library(ggcorrplot)
?ggcorrplot
ggcorrplot(corr = train_corr_matrix, type = "lower")

2. Conduct descriptive statistics of the dataset. Anything interesting you find ?

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

train$Sex_Male <- as.numeric(train$Sex=="male")

stargazer(train, type = "text")

## 
## ==============================================
## Statistic    N   Mean   St. Dev.  Min    Max  
## ----------------------------------------------
## PassengerId 891 446.000 257.354    1     891  
## Survived    891  0.384   0.487     0      1   
## Pclass      891  2.309   0.836     1      3   
## Age         714 29.699   14.526  0.420 80.000 
## SibSp       891  0.523   1.103     0      8   
## Parch       891  0.382   0.806     0      6   
## Fare        891 32.204   49.693  0.000 512.329
## Sex_Male    891  0.648   0.478     0      1   
## ----------------------------------------------

3. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.

set.seed(100)
?sample_n
train_subset <- sample_n(tbl = train, 
                         size = 500)

test_subset <- dplyr::anti_join(x = train,
                                y = train_subset, 
                                by = "PassengerId")

4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your `n=500` sample.

model0 <-
lm(data = train_subset, 
    formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2) + Fare)

model1 <-
lm(data = train_subset, 
    formula = Survived ~ as.factor(Pclass) + Sex + Fare)

summary(model1)

## 
## Call:
## lm(formula = Survived ~ as.factor(Pclass) + Sex + Fare, data = train_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8956 -0.2806 -0.1234  0.2446  0.8787 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.8534440  0.0616989  13.832  < 2e-16 ***
## as.factor(Pclass)2 -0.1038129  0.0631827  -1.643    0.101    
## as.factor(Pclass)3 -0.2595314  0.0546242  -4.751 2.65e-06 ***
## Sexmale            -0.4726555  0.0398321 -11.866  < 2e-16 ***
## Fare                0.0002781  0.0004890   0.569    0.570    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4073 on 495 degrees of freedom
## Multiple R-squared:  0.3013, Adjusted R-squared:  0.2957 
## F-statistic: 53.38 on 4 and 495 DF,  p-value: < 2.2e-16

5. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

What is on the rows vs what is on the columns? Example
What is event of interest (positive)? EG - positive is survived or died?

?predict

test_subset$prediction <- predict(object = model1, newdata = test_subset)

summary(test_subset$prediction)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1213  0.1238  0.3808  0.4096  0.6003  0.9959

test_subset$predicted_Survived <- ifelse(test = test_subset$prediction > .5, yes = 1, no = 0)

confusion_matrix_hand <- 
as.matrix(table(test_subset$Survived, 
      test_subset$predicted_Survived
      ))
confusion_matrix_hand

##    
##       0   1
##   0 202  36
##   1  36 117

TP <- confusion_matrix_hand[2,2]
TN <- confusion_matrix_hand[1,1]
FP <- confusion_matrix_hand[2,1]
FN <- confusion_matrix_hand[1,2] 

accuracy   <- (TP+TN)/(TP+TN+FP+FN)
sentivity  <- (TP) / (TP + FN)
specificty <- (TN) / (TN + FP)

(202+117) / (202+36+36+117) # accuracy

## [1] 0.8158568

117 / (117 + 36 ) # true positive rate / senstivity - TP/P = TP / (TP+FN)

## [1] 0.7647059

202 / (202 + 36)  # true negative rate / specificity - TN/N = TN / (TN+FP)

## [1] 0.8487395

# install.packages("caret")
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

?confusionMatrix

## Help on topic 'confusionMatrix' was found in the following packages:
## 
##   Package               Library
##   ModelMetrics          /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   caret                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## 
## Using the first match ...

caret::confusionMatrix(data      = as.factor(test_subset$predicted_Survived),
                       reference = as.factor(test_subset$Survived), 
                       positive  = "1"      # survived
                       )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 202  36
##          1  36 117
##                                          
##                Accuracy : 0.8159         
##                  95% CI : (0.7738, 0.853)
##     No Information Rate : 0.6087         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.6134         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.7647         
##             Specificity : 0.8487         
##          Pos Pred Value : 0.7647         
##          Neg Pred Value : 0.8487         
##              Prevalence : 0.3913         
##          Detection Rate : 0.2992         
##    Detection Prevalence : 0.3913         
##       Balanced Accuracy : 0.8067         
##                                          
##        'Positive' Class : 1              
##

titanic_dataset

AS

2024-07-30

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

1. Run some preliminary correlations of Survived with some other variables.

2. Conduct descriptive statistics of the dataset. Anything interesting you find ?

3. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.

4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.

5. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your `n=500` sample.