Using tidyverse to clean and preliminarily to explore data

In this section, we will tidy up the data set by using mainly the package tidyverse.

The main functions that we will use as follow:

** mutate(): add new variables and preserves existing ones

** filter(): choose rows/cases where conditions are true

** select(): keep only the variables you mention

** group_by(): take an existing tbl and converts it into a grouped tbl where operations are performed “by group”

** summarise(): create one or more scalar variables summarizing the variables of an existing tbl. Tbls with groups created by group_by() will result in one row in the output for each group. Tbls with no groups will result in one row

** count(): count observations

** gsub(): replace all matches of a string

** str_extract(): extract matching patterns from a string

** case_when:vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match, NA is returned

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.1

## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.1

## Warning: package 'dplyr' was built under R version 3.6.1

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 3.6.1

library(dplyr)

data <- read.csv(file = "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv", header = T, sep = ",")

str(data)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

The data contain 891 observations and 12 variables. Interger and numeric type : PassengerId, Survived, Pclass, Age, SibSp, Parch. Fare Factor type: Name, Sex, Ticket, Cabin, Embarked.

Description of the data set

PassengerID: Id of passengers Survived: (0 = No; 1 = Yes) Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) Name: Name Sex: Sex of passengers Age: Age Sibsp: Number of Siblings/Spouses Aboard Parch: Number of Parents/Children Aboard Ticket: Ticket Number Fare: Passenger Fare (British pound) Cabin: Cabin Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

head(data)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

tail(data)

##     PassengerId Survived Pclass                                     Name
## 886         886        0      3     Rice, Mrs. William (Margaret Norton)
## 887         887        0      2                    Montvila, Rev. Juozas
## 888         888        1      1             Graham, Miss. Margaret Edith
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie"
## 890         890        1      1                    Behr, Mr. Karl Howell
## 891         891        0      3                      Dooley, Mr. Patrick
##        Sex Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886 female  39     0     5     382652 29.125              Q
## 887   male  27     0     0     211536 13.000              S
## 888 female  19     0     0     112053 30.000   B42        S
## 889 female  NA     1     2 W./C. 6607 23.450              S
## 890   male  26     0     0     111369 30.000  C148        C
## 891   male  32     0     0     370376  7.750              Q

We can also plot the summary above using plot_intro function.

plot_intro(data)

There are 5 discret_columns, and 7 continuous columns, corresponding to the 5 columns of factor type and to the 7 columns of numeric and interger type, repectively. There are 177 missing values and 714 completed rows. There are 10692 observations = 891 rows * 12 varibales = 10692 There is 1.7% missing observations (177/10692 = 1.655%) complete rows :80.1% (714/891 = 80.13 %)

If we want to know which column containing missing values

plot_missing(data)

All the missing values are found in the column age. In fact, 19.87 % of data in this column is missing. There are 892 rows. There are 19.87% * 891 = 177 missing row, in agreement with the summary above Other way to check that number

sum(is.na(data$Age))

## [1] 177

If we want to see missing values for each column

sapply(data,function(x)sum(is.na(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

data %>%
  sapply(function(x) sum(is.na(x)))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Working with data types Change the “Pclass” and “Survived” attributes to factors.

data$Pclass <- factor(data$Pclass)
data$Survived <- factor(data$Survived)

We can verify the resulting datatype using function str function

class(data$Pclass)

## [1] "factor"

class(data$Survived)

## [1] "factor"

To be precised, we can use is.factor to check of these two columns are factor type now

is.factor(data$Pclass)

## [1] TRUE

is.factor(data$Survived)

## [1] TRUE

Working with missing values and REGEX

One way to deal with missing values is to replace NA row by the mean value of the column.

In this data set, we only have missing values for the age column. One way to deal with this situation is to replace the missing value by the average age. As the title may be related to the age, we will calculate the average age for each group of people who have the same title. As we have many titles (Mr, Ms, etc…) = many genders, we would assign the average age in function of title.

First, we should know how many titles in this data set

What are theses titles ? what are the average age of each of these groups ? First, we will create a new column that contains only the title. Adding a column title into a data set

How does the name column look like ?

data %>% 
  select(Name) %>% 
  head(20)

##                                                       Name
## 1                                  Braund, Mr. Owen Harris
## 2      Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3                                   Heikkinen, Miss. Laina
## 4             Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5                                 Allen, Mr. William Henry
## 6                                         Moran, Mr. James
## 7                                  McCarthy, Mr. Timothy J
## 8                           Palsson, Master. Gosta Leonard
## 9        Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
## 10                     Nasser, Mrs. Nicholas (Adele Achem)
## 11                         Sandstrom, Miss. Marguerite Rut
## 12                                Bonnell, Miss. Elizabeth
## 13                          Saundercock, Mr. William Henry
## 14                             Andersson, Mr. Anders Johan
## 15                    Vestrom, Miss. Hulda Amanda Adolfina
## 16                        Hewlett, Mrs. (Mary D Kingcome) 
## 17                                    Rice, Master. Eugene
## 18                            Williams, Mr. Charles Eugene
## 19 Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)
## 20                                 Masselmani, Mrs. Fatima

You can see that the title is included in the name. We should remove the title and create a new conlumn containing only the title

data <- data %>% 
  mutate(Title = regmatches(Name, regexpr("[A-z]+\\.", Name)))

data %>% 
  select(Name, Title) %>% 
  head(10)

##                                                   Name   Title
## 1                              Braund, Mr. Owen Harris     Mr.
## 2  Cumings, Mrs. John Bradley (Florence Briggs Thayer)    Mrs.
## 3                               Heikkinen, Miss. Laina   Miss.
## 4         Futrelle, Mrs. Jacques Heath (Lily May Peel)    Mrs.
## 5                             Allen, Mr. William Henry     Mr.
## 6                                     Moran, Mr. James     Mr.
## 7                              McCarthy, Mr. Timothy J     Mr.
## 8                       Palsson, Master. Gosta Leonard Master.
## 9    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)    Mrs.
## 10                 Nasser, Mrs. Nicholas (Adele Achem)    Mrs.

To remove title in the column name

data <- data %>%
  mutate(Name = gsub("[A-z]+\\.","",Name))

To check

data %>% 
  select(Name, Title) %>% 
  head(10)

##                                               Name   Title
## 1                             Braund,  Owen Harris     Mr.
## 2  Cumings,  John Bradley (Florence Briggs Thayer)    Mrs.
## 3                                Heikkinen,  Laina   Miss.
## 4         Futrelle,  Jacques Heath (Lily May Peel)    Mrs.
## 5                            Allen,  William Henry     Mr.
## 6                                    Moran,  James     Mr.
## 7                             McCarthy,  Timothy J     Mr.
## 8                          Palsson,  Gosta Leonard Master.
## 9    Johnson,  Oscar W (Elisabeth Vilhelmina Berg)    Mrs.
## 10                 Nasser,  Nicholas (Adele Achem)    Mrs.

Now, we should place the column title in after the column name

data <- data %>% 
  select(PassengerId, Name, Title, Sex, Age, everything())

head(data, 10)

##    PassengerId                                            Name   Title
## 1            1                            Braund,  Owen Harris     Mr.
## 2            2 Cumings,  John Bradley (Florence Briggs Thayer)    Mrs.
## 3            3                               Heikkinen,  Laina   Miss.
## 4            4        Futrelle,  Jacques Heath (Lily May Peel)    Mrs.
## 5            5                           Allen,  William Henry     Mr.
## 6            6                                   Moran,  James     Mr.
## 7            7                            McCarthy,  Timothy J     Mr.
## 8            8                         Palsson,  Gosta Leonard Master.
## 9            9   Johnson,  Oscar W (Elisabeth Vilhelmina Berg)    Mrs.
## 10          10                 Nasser,  Nicholas (Adele Achem)    Mrs.
##       Sex Age Survived Pclass SibSp Parch           Ticket    Fare Cabin
## 1    male  22        0      3     1     0        A/5 21171  7.2500      
## 2  female  38        1      1     1     0         PC 17599 71.2833   C85
## 3  female  26        1      3     0     0 STON/O2. 3101282  7.9250      
## 4  female  35        1      1     1     0           113803 53.1000  C123
## 5    male  35        0      3     0     0           373450  8.0500      
## 6    male  NA        0      3     0     0           330877  8.4583      
## 7    male  54        0      1     0     0            17463 51.8625   E46
## 8    male   2        0      3     3     1           349909 21.0750      
## 9  female  27        1      3     0     2           347742 11.1333      
## 10 female  14        1      2     1     0           237736 30.0708      
##    Embarked
## 1         S
## 2         C
## 3         S
## 4         S
## 5         S
## 6         Q
## 7         S
## 8         S
## 9         S
## 10        C

Counting the title and order the output

data %>% 
group_by (Title) %>%
count(sort = TRUE)

## # A tibble: 17 x 2
## # Groups:   Title [17]
##    Title         n
##    <chr>     <int>
##  1 Mr.         517
##  2 Miss.       182
##  3 Mrs.        125
##  4 Master.      40
##  5 Dr.           7
##  6 Rev.          6
##  7 Col.          2
##  8 Major.        2
##  9 Mlle.         2
## 10 Capt.         1
## 11 Countess.     1
## 12 Don.          1
## 13 Jonkheer.     1
## 14 Lady.         1
## 15 Mme.          1
## 16 Ms.           1
## 17 Sir.          1

There are 17 groups. Mr., Miss., Mrs. are the most common among titles. There are some umcommon titles such as:Capt, Countess, Don., Jonkheer., Lady., Mme., Ms., and Sir.

Calculating the average for 17 groups

data %>%
  group_by (Title) %>%
  summarise (average = mean(Age, na.rm = TRUE))

## # A tibble: 17 x 2
##    Title     average
##    <chr>       <dbl>
##  1 Capt.       70   
##  2 Col.        58   
##  3 Countess.   33   
##  4 Don.        40   
##  5 Dr.         42   
##  6 Jonkheer.   38   
##  7 Lady.       48   
##  8 Major.      48.5 
##  9 Master.      4.57
## 10 Miss.       21.8 
## 11 Mlle.       24   
## 12 Mme.        24   
## 13 Mr.         32.4 
## 14 Mrs.        35.9 
## 15 Ms.         28   
## 16 Rev.        43.2 
## 17 Sir.        49

Miss., 182 rows, the average age = 21.77 Mr., 517 rows, the average age = 32.36 Mrs.,125 rows, the average age= 35.89 Master., 40 rows,the average age = 4.57

Assign the missing values of the age column by respectable values

# train$Age[train$Title == "Mr." & is.na(train$Age)] <- mean(train$Age[train$Title == "Mr."], na.rm = TRUE)
# train$Age[train$Title == "Miss." & is.na(train$Age)] <- mean(train$Age[train$Title == "Miss."], na.rm = TRUE)
# train$Age[train$Title == "Mrs." & is.na(train$Age)] <- mean(train$Age[train$Title == "Mrs."], na.rm = TRUE)
# train$Age[train$Title == "Master." & is.na(train$Age)] <- mean(train$Age[train$Title == "Master."], na.rm = TRUE)

data <- data %>% 
  mutate(Age = case_when(Title == "Mr." & is.na(Age) ~ mean(Age[Title == "Mr."], na.rm = TRUE),
                         Title == "Miss." & is.na(Age)~ mean(Age[Title == "Miss."], na.rm = TRUE),
                         Title == "Mrs." & is.na(Age) ~ mean(Age[Title == "Mrs."], na.rm = TRUE),
                         Title == "Master." & is.na(Age) ~ mean(Age[Title == "Master."], na.rm = TRUE), 
                         TRUE ~ Age))

sum(is.na(data$Age))

## [1] 1

The data still contain one missing value. To find out

data[which(is.na(data$Age)), c("Name", "Title", "Age")]

##                       Name Title Age
## 767 Brewe,  Arthur Jackson   Dr.  NA

Let replace the missing value by the mean age of all rows where Title is Dr.

data$Age[data$Title == "Dr." & is.na(data$Age)] <- mean(data$Age[data$Title == "Dr."], na.rm = TRUE)

To check

sum(is.na(data$Age))

## [1] 0

6.Initial analysis

How many man and women in the data set

data %>% 
  group_by(Sex) %>% 
  count()

## # A tibble: 2 x 2
## # Groups:   Sex [2]
##   Sex        n
##   <fct>  <int>
## 1 female   314
## 2 male     577

The mean age of each group

data%>% 
group_by (Sex)%>%
summarise(avg = mean(Age, na.rm = TRUE))

## # A tibble: 2 x 2
##   Sex      avg
##   <fct>  <dbl>
## 1 female  27.6
## 2 male    30.9

How many people survived

data %>% 
  group_by(Survived, Sex) %>% 
  count()

## # A tibble: 4 x 3
## # Groups:   Survived, Sex [4]
##   Survived Sex        n
##   <fct>    <fct>  <int>
## 1 0        female    81
## 2 0        male     468
## 3 1        female   233
## 4 1        male     109

data %>% 
  filter(Title == c("Mrs.","Mr.", "Miss.")) %>% 
  group_by(Title, Survived) %>% 
  count(sort = TRUE)

## # A tibble: 6 x 3
## # Groups:   Title, Survived [6]
##   Title Survived     n
##   <chr> <fct>    <int>
## 1 Mr.   0          147
## 2 Miss. 1           40
## 3 Mr.   1           31
## 4 Mrs.  1           31
## 5 Miss. 0           16
## 6 Mrs.  0            9

data %>%
  group_by (Embarked)%>%
  count(sort= TRUE)

## # A tibble: 4 x 2
## # Groups:   Embarked [4]
##   Embarked     n
##   <fct>    <int>
## 1 S          644
## 2 C          168
## 3 Q           77
## 4 ""           2

data %>% 
  group_by (Cabin)%>%
  count(sort= TRUE) %>% 
  head(10)

## # A tibble: 10 x 2
## # Groups:   Cabin [10]
##    Cabin           n
##    <fct>       <int>
##  1 ""            687
##  2 B96 B98         4
##  3 C23 C25 C27     4
##  4 G6              4
##  5 C22 C26         3
##  6 D               3
##  7 E101            3
##  8 F2              3
##  9 F33             3
## 10 B18             2

data%>% 
  group_by (Pclass)%>%
  count(sort= TRUE)

## # A tibble: 3 x 2
## # Groups:   Pclass [3]
##   Pclass     n
##   <fct>  <int>
## 1 3        491
## 2 1        216
## 3 2        184

We can see here there is 687 rows that there is no cabin label We can remove this column

data <- data %>%
  select(-Cabin)

The final data(consistent data)

summary(data)

##   PassengerId        Name              Title               Sex     
##  Min.   :  1.0   Length:891         Length:891         female:314  
##  1st Qu.:223.5   Class :character   Class :character   male  :577  
##  Median :446.0   Mode  :character   Mode  :character               
##  Mean   :446.0                                                     
##  3rd Qu.:668.5                                                     
##  Max.   :891.0                                                     
##                                                                    
##       Age        Survived Pclass      SibSp           Parch       
##  Min.   : 0.42   0:549    1:216   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:21.77   1:342    2:184   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :30.00            3:491   Median :0.000   Median :0.0000  
##  Mean   :29.75                    Mean   :0.523   Mean   :0.3816  
##  3rd Qu.:35.90                    3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   :80.00                    Max.   :8.000   Max.   :6.0000  
##                                                                   
##       Ticket         Fare        Embarked
##  1601    :  7   Min.   :  0.00    :  2   
##  347082  :  7   1st Qu.:  7.91   C:168   
##  CA. 2343:  7   Median : 14.45   Q: 77   
##  3101295 :  6   Mean   : 32.20   S:644   
##  347088  :  6   3rd Qu.: 31.00           
##  CA 2144 :  6   Max.   :512.33           
##  (Other) :852