First let’s import the dataset using read.csv command

train <- read.csv("C:/Users/atomi/Desktop/Data Analysis Class/Spring 2025/titanic/train.csv")

Next, let’s view the NA’s by variable

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Notice, there are no missing variables in the Cabin nor embarked columns. This has to do with these not being numbers, so R is assuming that blank is a category of some sort.

Fixing NA issue when importing the data

use read_csv instead of read.csv

You can use the read_csv command (readr option in “import dataset” menu), using library(readr). This will automatically recognize empty cells as missing observations, but could (would) struggle with different codes for missing variables (e.g. “.”)

library(readr)
train1 <- read_csv("C:/Users/atomi/Desktop/Data Analysis Class/Spring 2025/titanic/train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

let’s check for NA’s

colSums(is.na(train1))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Notice, we now see the 687 missing variables for Cabin.

fix on import with read.csv (entire dataframe)

note: if you have multiple designators for NAs (e.g. empty spaces, 999, etc.) you can simply add them in the na.strings portion. For example if empty space. 999, and . are all codes for NA, then na.strings=c(““, 999,”.”) should work. CAUTION: this only work if this is true in the entire set For example, I could have a situation where 999 denotes missing variable in Age, but it is an actual observation in Passenger ID. Then this would not work, at least not as written here.

train2 <- read.csv("C:/Users/atomi/Desktop/Data Analysis Class/Spring 2025/titanic/train.csv", na.strings = c(""))

colSums(is.na(train2))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Notice, we now see 687 missing observations in Cabin and 2 in Embark.

Fixing NAs for individual variables after importing

Remember, dataframe “train” does not recognize empty spaces as NAs. let’s first verify that

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Now, to avoid “accidents” let’s create a new variable “Cabin 1” (copy of Cabin) and replace empty cells with NAs

train$Cabin1<-train$Cabin  #new variable that is copy of Cabin
sum(is.na(train$Cabin1))  #verifying that NAs are still not recognized
## [1] 0

Now let’s replace empty cells with NAs that are recognized as missing observations

train$Cabin1[train$Cabin1==""]<-NA #replacing empty cells with NA (recognizing them as missing)
sum(is.na(train$Cabin1)) #verifying that it worked
## [1] 687