In this section, we will tidy up the data set by using mainly the package tidyverse.
The main functions that we will use as follow:
** mutate(): add new variables and preserves existing ones
** filter(): choose rows/cases where conditions are true
** select(): keep only the variables you mention
** group_by(): take an existing tbl and converts it into a grouped tbl where operations are performed “by group”
** summarise(): create one or more scalar variables summarizing the variables of an existing tbl. Tbls with groups created by group_by() will result in one row in the output for each group. Tbls with no groups will result in one row
** count(): count observations
** gsub(): replace all matches of a string
** str_extract(): extract matching patterns from a string
** case_when:vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match, NA is returned
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.1
## Warning: package 'dplyr' was built under R version 3.6.1
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.6.1
library(dplyr)
data <- read.csv(file = "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv", header = T, sep = ",")
str(data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The data contain 891 observations and 12 variables. Interger and numeric type : PassengerId, Survived, Pclass, Age, SibSp, Parch. Fare Factor type: Name, Sex, Ticket, Cabin, Embarked.
Description of the data set
PassengerID: Id of passengers Survived: (0 = No; 1 = Yes) Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) Name: Name Sex: Sex of passengers Age: Age Sibsp: Number of Siblings/Spouses Aboard Parch: Number of Parents/Children Aboard Ticket: Ticket Number Fare: Passenger Fare (British pound) Cabin: Cabin Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
head(data)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
tail(data)
## PassengerId Survived Pclass Name
## 886 886 0 3 Rice, Mrs. William (Margaret Norton)
## 887 887 0 2 Montvila, Rev. Juozas
## 888 888 1 1 Graham, Miss. Margaret Edith
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
## 890 890 1 1 Behr, Mr. Karl Howell
## 891 891 0 3 Dooley, Mr. Patrick
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 886 female 39 0 5 382652 29.125 Q
## 887 male 27 0 0 211536 13.000 S
## 888 female 19 0 0 112053 30.000 B42 S
## 889 female NA 1 2 W./C. 6607 23.450 S
## 890 male 26 0 0 111369 30.000 C148 C
## 891 male 32 0 0 370376 7.750 Q
We can also plot the summary above using plot_intro function.
plot_intro(data)
There are 5 discret_columns, and 7 continuous columns, corresponding to the 5 columns of factor type and to the 7 columns of numeric and interger type, repectively. There are 177 missing values and 714 completed rows. There are 10692 observations = 891 rows * 12 varibales = 10692 There is 1.7% missing observations (177/10692 = 1.655%) complete rows :80.1% (714/891 = 80.13 %)
If we want to know which column containing missing values
plot_missing(data)
All the missing values are found in the column age. In fact, 19.87 % of data in this column is missing. There are 892 rows. There are 19.87% * 891 = 177 missing row, in agreement with the summary above Other way to check that number
sum(is.na(data$Age))
## [1] 177
If we want to see missing values for each column
sapply(data,function(x)sum(is.na(x)))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
data %>%
sapply(function(x) sum(is.na(x)))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
data$Pclass <- factor(data$Pclass)
data$Survived <- factor(data$Survived)
We can verify the resulting datatype using function str function
class(data$Pclass)
## [1] "factor"
class(data$Survived)
## [1] "factor"
To be precised, we can use is.factor to check of these two columns are factor type now
is.factor(data$Pclass)
## [1] TRUE
is.factor(data$Survived)
## [1] TRUE
One way to deal with missing values is to replace NA row by the mean value of the column.
In this data set, we only have missing values for the age column. One way to deal with this situation is to replace the missing value by the average age. As the title may be related to the age, we will calculate the average age for each group of people who have the same title. As we have many titles (Mr, Ms, etc…) = many genders, we would assign the average age in function of title.
First, we should know how many titles in this data set
What are theses titles ? what are the average age of each of these groups ? First, we will create a new column that contains only the title. Adding a column title into a data set
How does the name column look like ?
data %>%
select(Name) %>%
head(20)
## Name
## 1 Braund, Mr. Owen Harris
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3 Heikkinen, Miss. Laina
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5 Allen, Mr. William Henry
## 6 Moran, Mr. James
## 7 McCarthy, Mr. Timothy J
## 8 Palsson, Master. Gosta Leonard
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
## 10 Nasser, Mrs. Nicholas (Adele Achem)
## 11 Sandstrom, Miss. Marguerite Rut
## 12 Bonnell, Miss. Elizabeth
## 13 Saundercock, Mr. William Henry
## 14 Andersson, Mr. Anders Johan
## 15 Vestrom, Miss. Hulda Amanda Adolfina
## 16 Hewlett, Mrs. (Mary D Kingcome)
## 17 Rice, Master. Eugene
## 18 Williams, Mr. Charles Eugene
## 19 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)
## 20 Masselmani, Mrs. Fatima
You can see that the title is included in the name. We should remove the title and create a new conlumn containing only the title
data <- data %>%
mutate(Title = regmatches(Name, regexpr("[A-z]+\\.", Name)))
data %>%
select(Name, Title) %>%
head(10)
## Name Title
## 1 Braund, Mr. Owen Harris Mr.
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) Mrs.
## 3 Heikkinen, Miss. Laina Miss.
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs.
## 5 Allen, Mr. William Henry Mr.
## 6 Moran, Mr. James Mr.
## 7 McCarthy, Mr. Timothy J Mr.
## 8 Palsson, Master. Gosta Leonard Master.
## 9 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Mrs.
## 10 Nasser, Mrs. Nicholas (Adele Achem) Mrs.
To remove title in the column name
data <- data %>%
mutate(Name = gsub("[A-z]+\\.","",Name))
To check
data %>%
select(Name, Title) %>%
head(10)
## Name Title
## 1 Braund, Owen Harris Mr.
## 2 Cumings, John Bradley (Florence Briggs Thayer) Mrs.
## 3 Heikkinen, Laina Miss.
## 4 Futrelle, Jacques Heath (Lily May Peel) Mrs.
## 5 Allen, William Henry Mr.
## 6 Moran, James Mr.
## 7 McCarthy, Timothy J Mr.
## 8 Palsson, Gosta Leonard Master.
## 9 Johnson, Oscar W (Elisabeth Vilhelmina Berg) Mrs.
## 10 Nasser, Nicholas (Adele Achem) Mrs.
Now, we should place the column title in after the column name
data <- data %>%
select(PassengerId, Name, Title, Sex, Age, everything())
head(data, 10)
## PassengerId Name Title
## 1 1 Braund, Owen Harris Mr.
## 2 2 Cumings, John Bradley (Florence Briggs Thayer) Mrs.
## 3 3 Heikkinen, Laina Miss.
## 4 4 Futrelle, Jacques Heath (Lily May Peel) Mrs.
## 5 5 Allen, William Henry Mr.
## 6 6 Moran, James Mr.
## 7 7 McCarthy, Timothy J Mr.
## 8 8 Palsson, Gosta Leonard Master.
## 9 9 Johnson, Oscar W (Elisabeth Vilhelmina Berg) Mrs.
## 10 10 Nasser, Nicholas (Adele Achem) Mrs.
## Sex Age Survived Pclass SibSp Parch Ticket Fare Cabin
## 1 male 22 0 3 1 0 A/5 21171 7.2500
## 2 female 38 1 1 1 0 PC 17599 71.2833 C85
## 3 female 26 1 3 0 0 STON/O2. 3101282 7.9250
## 4 female 35 1 1 1 0 113803 53.1000 C123
## 5 male 35 0 3 0 0 373450 8.0500
## 6 male NA 0 3 0 0 330877 8.4583
## 7 male 54 0 1 0 0 17463 51.8625 E46
## 8 male 2 0 3 3 1 349909 21.0750
## 9 female 27 1 3 0 2 347742 11.1333
## 10 female 14 1 2 1 0 237736 30.0708
## Embarked
## 1 S
## 2 C
## 3 S
## 4 S
## 5 S
## 6 Q
## 7 S
## 8 S
## 9 S
## 10 C
Counting the title and order the output
data %>%
group_by (Title) %>%
count(sort = TRUE)
## # A tibble: 17 x 2
## # Groups: Title [17]
## Title n
## <chr> <int>
## 1 Mr. 517
## 2 Miss. 182
## 3 Mrs. 125
## 4 Master. 40
## 5 Dr. 7
## 6 Rev. 6
## 7 Col. 2
## 8 Major. 2
## 9 Mlle. 2
## 10 Capt. 1
## 11 Countess. 1
## 12 Don. 1
## 13 Jonkheer. 1
## 14 Lady. 1
## 15 Mme. 1
## 16 Ms. 1
## 17 Sir. 1
There are 17 groups. Mr., Miss., Mrs. are the most common among titles. There are some umcommon titles such as:Capt, Countess, Don., Jonkheer., Lady., Mme., Ms., and Sir.
Calculating the average for 17 groups
data %>%
group_by (Title) %>%
summarise (average = mean(Age, na.rm = TRUE))
## # A tibble: 17 x 2
## Title average
## <chr> <dbl>
## 1 Capt. 70
## 2 Col. 58
## 3 Countess. 33
## 4 Don. 40
## 5 Dr. 42
## 6 Jonkheer. 38
## 7 Lady. 48
## 8 Major. 48.5
## 9 Master. 4.57
## 10 Miss. 21.8
## 11 Mlle. 24
## 12 Mme. 24
## 13 Mr. 32.4
## 14 Mrs. 35.9
## 15 Ms. 28
## 16 Rev. 43.2
## 17 Sir. 49
Miss., 182 rows, the average age = 21.77 Mr., 517 rows, the average age = 32.36 Mrs.,125 rows, the average age= 35.89 Master., 40 rows,the average age = 4.57
Assign the missing values of the age column by respectable values
# train$Age[train$Title == "Mr." & is.na(train$Age)] <- mean(train$Age[train$Title == "Mr."], na.rm = TRUE)
# train$Age[train$Title == "Miss." & is.na(train$Age)] <- mean(train$Age[train$Title == "Miss."], na.rm = TRUE)
# train$Age[train$Title == "Mrs." & is.na(train$Age)] <- mean(train$Age[train$Title == "Mrs."], na.rm = TRUE)
# train$Age[train$Title == "Master." & is.na(train$Age)] <- mean(train$Age[train$Title == "Master."], na.rm = TRUE)
data <- data %>%
mutate(Age = case_when(Title == "Mr." & is.na(Age) ~ mean(Age[Title == "Mr."], na.rm = TRUE),
Title == "Miss." & is.na(Age)~ mean(Age[Title == "Miss."], na.rm = TRUE),
Title == "Mrs." & is.na(Age) ~ mean(Age[Title == "Mrs."], na.rm = TRUE),
Title == "Master." & is.na(Age) ~ mean(Age[Title == "Master."], na.rm = TRUE),
TRUE ~ Age))
sum(is.na(data$Age))
## [1] 1
The data still contain one missing value. To find out
data[which(is.na(data$Age)), c("Name", "Title", "Age")]
## Name Title Age
## 767 Brewe, Arthur Jackson Dr. NA
Let replace the missing value by the mean age of all rows where Title is Dr.
data$Age[data$Title == "Dr." & is.na(data$Age)] <- mean(data$Age[data$Title == "Dr."], na.rm = TRUE)
To check
sum(is.na(data$Age))
## [1] 0
6.Initial analysis
How many man and women in the data set
data %>%
group_by(Sex) %>%
count()
## # A tibble: 2 x 2
## # Groups: Sex [2]
## Sex n
## <fct> <int>
## 1 female 314
## 2 male 577
The mean age of each group
data%>%
group_by (Sex)%>%
summarise(avg = mean(Age, na.rm = TRUE))
## # A tibble: 2 x 2
## Sex avg
## <fct> <dbl>
## 1 female 27.6
## 2 male 30.9
How many people survived
data %>%
group_by(Survived, Sex) %>%
count()
## # A tibble: 4 x 3
## # Groups: Survived, Sex [4]
## Survived Sex n
## <fct> <fct> <int>
## 1 0 female 81
## 2 0 male 468
## 3 1 female 233
## 4 1 male 109
data %>%
filter(Title == c("Mrs.","Mr.", "Miss.")) %>%
group_by(Title, Survived) %>%
count(sort = TRUE)
## # A tibble: 6 x 3
## # Groups: Title, Survived [6]
## Title Survived n
## <chr> <fct> <int>
## 1 Mr. 0 147
## 2 Miss. 1 40
## 3 Mr. 1 31
## 4 Mrs. 1 31
## 5 Miss. 0 16
## 6 Mrs. 0 9
data %>%
group_by (Embarked)%>%
count(sort= TRUE)
## # A tibble: 4 x 2
## # Groups: Embarked [4]
## Embarked n
## <fct> <int>
## 1 S 644
## 2 C 168
## 3 Q 77
## 4 "" 2
data %>%
group_by (Cabin)%>%
count(sort= TRUE) %>%
head(10)
## # A tibble: 10 x 2
## # Groups: Cabin [10]
## Cabin n
## <fct> <int>
## 1 "" 687
## 2 B96 B98 4
## 3 C23 C25 C27 4
## 4 G6 4
## 5 C22 C26 3
## 6 D 3
## 7 E101 3
## 8 F2 3
## 9 F33 3
## 10 B18 2
data%>%
group_by (Pclass)%>%
count(sort= TRUE)
## # A tibble: 3 x 2
## # Groups: Pclass [3]
## Pclass n
## <fct> <int>
## 1 3 491
## 2 1 216
## 3 2 184
We can see here there is 687 rows that there is no cabin label We can remove this column
data <- data %>%
select(-Cabin)
The final data(consistent data)
summary(data)
## PassengerId Name Title Sex
## Min. : 1.0 Length:891 Length:891 female:314
## 1st Qu.:223.5 Class :character Class :character male :577
## Median :446.0 Mode :character Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age Survived Pclass SibSp Parch
## Min. : 0.42 0:549 1:216 Min. :0.000 Min. :0.0000
## 1st Qu.:21.77 1:342 2:184 1st Qu.:0.000 1st Qu.:0.0000
## Median :30.00 3:491 Median :0.000 Median :0.0000
## Mean :29.75 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.90 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
##
## Ticket Fare Embarked
## 1601 : 7 Min. : 0.00 : 2
## 347082 : 7 1st Qu.: 7.91 C:168
## CA. 2343: 7 Median : 14.45 Q: 77
## 3101295 : 6 Mean : 32.20 S:644
## 347088 : 6 3rd Qu.: 31.00
## CA 2144 : 6 Max. :512.33
## (Other) :852