I would recommend to visit http://rpubs.com/vikassinha167/530654 before reading below article.
After you have read above mentioned article, we will explore classic “Titanic” case.
#Uncomment below lines of code if the packages are not installed
# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages("funModeling")
# Load the libraries
library(Hmisc)
library(funModeling)
library(tidyverse)
Let us check the structure of dataset Titanic using two methods
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
glimpse(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
Titanic dataset is a Table , not a direct Data Frame or Matrix or vector
class(Titanic)
## [1] "table"
print(Titanic)
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
Let us convert this “table” to more user friendly “data frame”
TitanicDataFrame <- as.data.frame(Titanic)
Now check the class and data within this data frame.
class(TitanicDataFrame)
## [1] "data.frame"
print(TitanicDataFrame)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## 11 3rd Male Adult No 387
## 12 Crew Male Adult No 670
## 13 1st Female Adult No 4
## 14 2nd Female Adult No 13
## 15 3rd Female Adult No 89
## 16 Crew Female Adult No 3
## 17 1st Male Child Yes 5
## 18 2nd Male Child Yes 11
## 19 3rd Male Child Yes 13
## 20 Crew Male Child Yes 0
## 21 1st Female Child Yes 1
## 22 2nd Female Child Yes 13
## 23 3rd Female Child Yes 14
## 24 Crew Female Child Yes 0
## 25 1st Male Adult Yes 57
## 26 2nd Male Adult Yes 14
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
As you notice above , the Titanic “table” got coerced to Titanic " data frame“. Let us check the structure of this data frame.
glimpse(TitanicDataFrame)
## Observations: 32
## Variables: 5
## $ Class <fct> 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew, 1st, 2nd, 3...
## $ Sex <fct> Male, Male, Male, Male, Female, Female, Female, Femal...
## $ Age <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
## $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...
## $ Freq <dbl> 0, 0, 35, 0, 0, 0, 17, 0, 118, 154, 387, 670, 4, 13, ...
str(TitanicDataFrame)
## 'data.frame': 32 obs. of 5 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
## $ Age : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq : num 0 0 35 0 0 0 17 0 118 154 ...
Let us check the status of data i.e how many are zero , na and infinite.
df_status(TitanicDataFrame)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 Class 0 0 0 0 0 0 factor 4
## 2 Sex 0 0 0 0 0 0 factor 2
## 3 Age 0 0 0 0 0 0 factor 2
## 4 Survived 0 0 0 0 0 0 factor 2
## 5 Freq 8 25 0 0 0 0 numeric 22
We are mostly concerned with NA and Infinite values. They are zero in this case Hence we consider our data to be good. Let us plot the frequency graph for our categorical variables.
freq(TitanicDataFrame)
## Class frequency percentage cumulative_perc
## 1 1st 8 25 25
## 2 2nd 8 25 50
## 3 3rd 8 25 75
## 4 Crew 8 25 100
## Sex frequency percentage cumulative_perc
## 1 Male 16 50 50
## 2 Female 16 50 100
## Age frequency percentage cumulative_perc
## 1 Child 16 50 50
## 2 Adult 16 50 100
## Survived frequency percentage cumulative_perc
## 1 No 16 50 50
## 2 Yes 16 50 100
## [1] "Variables processed: Class, Sex, Age, Survived"
Let us profile the numerical variable in the Titanic data set
plot_num(TitanicDataFrame)
profiling_num(TitanicDataFrame)
## variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75
## 1 Freq 68.78125 135.9959 1.977224 0 0 0.75 13.5 77
## p_95 p_99 skewness kurtosis iqr range_98 range_80
## 1 279.75 582.27 3.224419 13.77958 76.25 [0, 582.27] [0, 152.6]
describe(TitanicDataFrame)
## TitanicDataFrame
##
## 5 Variables 32 Observations
## ---------------------------------------------------------------------------
## Class
## n missing distinct
## 32 0 4
##
## Value 1st 2nd 3rd Crew
## Frequency 8 8 8 8
## Proportion 0.25 0.25 0.25 0.25
## ---------------------------------------------------------------------------
## Sex
## n missing distinct
## 32 0 2
##
## Value Male Female
## Frequency 16 16
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## Age
## n missing distinct
## 32 0 2
##
## Value Child Adult
## Frequency 16 16
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## Survived
## n missing distinct
## 32 0 2
##
## Value No Yes
## Frequency 16 16
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
## Freq
## n missing distinct Info Mean Gmd .05 .10
## 32 0 22 0.984 68.78 106.4 0.00 0.00
## .25 .50 .75 .90 .95
## 0.75 13.50 77.00 152.60 279.75
##
## lowest : 0 1 3 4 5, highest: 140 154 192 387 670
## ---------------------------------------------------------------------------
Let us as well plot Freq variable and check its distribution.
ggplot(TitanicDataFrame,mapping = aes(x=TitanicDataFrame$Freq)) + geom_density()
ggplot(TitanicDataFrame,mapping = aes(x=TitanicDataFrame$Freq)) + geom_histogram(aes(y = ..density..),
colour="black", fill="brown")
Also prove the same distribution using the descritpive analysis.
dispersion <- profiling_num(TitanicDataFrame$Freq)
kurtosis <-dispersion$kurtosis
kurtosis
## [1] 13.77958
skewness<-dispersion$skewness
skewness
## [1] 3.224419
Both the plots and values of Kurtosis and Skewness indicate Freq is not Normally distributed rather it is right skewed with a heavy right tail.