First let’s import the dataset using read.csv command
train <- read.csv("C:/Users/atomi/Desktop/Data Analysis Class/Spring 2025/titanic/train.csv")
Next, let’s view the NA’s by variable
colSums(is.na(train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Notice, there are no missing variables in the Cabin nor embarked columns. This has to do with these not being numbers, so R is assuming that blank is a category of some sort.
You can use the read_csv command (readr option in “import dataset” menu), using library(readr). This will automatically recognize empty cells as missing observations, but could (would) struggle with different codes for missing variables (e.g. “.”)
library(readr)
train1 <- read_csv("C:/Users/atomi/Desktop/Data Analysis Class/Spring 2025/titanic/train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
let’s check for NA’s
colSums(is.na(train1))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
Notice, we now see the 687 missing variables for Cabin.
note: if you have multiple designators for NAs (e.g. empty spaces, 999, etc.) you can simply add them in the na.strings portion. For example if empty space. 999, and . are all codes for NA, then na.strings=c(““, 999,”.”) should work. CAUTION: this only work if this is true in the entire set For example, I could have a situation where 999 denotes missing variable in Age, but it is an actual observation in Passenger ID. Then this would not work, at least not as written here.
train2 <- read.csv("C:/Users/atomi/Desktop/Data Analysis Class/Spring 2025/titanic/train.csv", na.strings = c(""))
colSums(is.na(train2))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
Notice, we now see 687 missing observations in Cabin and 2 in Embark.
Remember, dataframe “train” does not recognize empty spaces as NAs. let’s first verify that
colSums(is.na(train))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Now, to avoid “accidents” let’s create a new variable “Cabin 1” (copy of Cabin) and replace empty cells with NAs
train$Cabin1<-train$Cabin #new variable that is copy of Cabin
sum(is.na(train$Cabin1)) #verifying that NAs are still not recognized
## [1] 0
Now let’s replace empty cells with NAs that are recognized as missing observations
train$Cabin1[train$Cabin1==""]<-NA #replacing empty cells with NA (recognizing them as missing)
sum(is.na(train$Cabin1)) #verifying that it worked
## [1] 687